INDEX
Explanations
references to physical violence such as assault and mugging
occurrences of the word "mug" and related references
New Auto-Interp
Negative Logits
ISION
-0.77
edient
-0.72
Virgin
-0.70
Domin
-0.69
×Ļ×
-0.68
ISE
-0.68
IGH
-0.67
Doctrine
-0.66
ipher
-0.66
cision
-0.64
POSITIVE LOGITS
mug
1.12
gers
1.11
shots
1.06
ging
1.02
shot
0.98
ger
0.96
atures
0.92
ged
0.88
glers
0.86
gery
0.84
Activations Density 0.007%