INDEX
Explanations
references to murder and related violent acts
New Auto-Interp
Negative Logits
ваÑĢ
-0.17
Thief
-0.15
lage
-0.15
htt
-0.14
_corners
-0.14
nie
-0.14
906
-0.14
unes
-0.14
imate
-0.14
eward
-0.14
POSITIVE LOGITS
ously
0.33
ous
0.28
abilia
0.25
esses
0.23
-su
0.21
mystery
0.20
spree
0.20
OUS
0.20
pedia
0.18
scene
0.18
Activations Density 0.016%