INDEX
Explanations
terms and concepts related to different types of abuse and violence
New Auto-Interp
Negative Logits
rief
-0.20
omb
-0.16
rei
-0.16
edia
-0.15
lify
-0.15
vise
-0.15
aris
-0.15
laÅŁ
-0.15
ots
-0.14
enes
-0.14
POSITIVE LOGITS
ini
0.16
ulent
0.16
iveness
0.16
INI
0.15
734
0.15
/man
0.15
uous
0.15
227
0.14
ManagerInterface
0.14
δα
0.14
Activations Density 0.031%