INDEX
Explanations
references to murder and related violent acts
New Auto-Interp
Negative Logits
nie
-0.17
Thief
-0.16
ваÑĢ
-0.15
imate
-0.15
thora
-0.15
htt
-0.14
/buttons
-0.14
stract
-0.14
IMUM
-0.14
_corners
-0.14
POSITIVE LOGITS
ously
0.32
abilia
0.25
esses
0.25
ous
0.25
-su
0.23
mystery
0.21
ess
0.21
OUS
0.20
attempt
0.18
plot
0.18
Activations Density 0.023%