INDEX
Explanations
mentions of "devil" and related concepts
New Auto-Interp
Negative Logits
iefs
-0.16
emean
-0.15
roz
-0.14
izzlies
-0.14
_TV
-0.14
rror
-0.14
983
-0.14
Lens
-0.14
Constantin
-0.14
exual
-0.13
POSITIVE LOGITS
ishly
0.23
ry
0.21
ish
0.20
ution
0.19
ridge
0.18
/dev
0.18
ISH
0.18
UTION
0.16
bane
0.16
sd
0.16
Activations Density 0.011%