INDEX
Explanations
references to victims of crimes or abuse
New Auto-Interp
Negative Logits
enta
-0.17
aber
-0.17
apur
-0.16
rn
-0.15
iname
-0.15
azor
-0.14
-speaking
-0.14
sian
-0.14
aker
-0.14
ald
-0.14
POSITIVE LOGITS
hood
0.17
friendly
0.16
ëĭ¹
0.16
ivors
0.16
úsqueda
0.15
änn
0.14
Friendly
0.14
innocent
0.14
Äħż
0.14
Zaman
0.14
Activations Density 0.017%