INDEX
Explanations
references to being imprisoned or held captive
references to prisoners and their experiences
New Auto-Interp
Negative Logits
orp
-0.79
Boll
-0.75
OPA
-0.70
amera
-0.68
wig
-0.66
orie
-0.66
drive
-0.64
alore
-0.64
ulously
-0.64
ories
-0.63
POSITIVE LOGITS
prisoners
1.01
prisoner
0.90
captives
0.88
inmates
0.87
sentenced
0.83
detainees
0.79
incarcerated
0.78
confinement
0.77
icts
0.73
captive
0.73
Activations Density 0.025%