INDEX
Explanations
vivid imagery and descriptions of violent or intense scenes
New Auto-Interp
Negative Logits
sire
-0.16
winds
-0.15
åIJ
-0.15
Scre
-0.14
871
-0.14
(
-0.14
synd
-0.14
placement
-0.14
æł
-0.14
ughter
-0.14
POSITIVE LOGITS
кап
0.17
ngu
0.16
LBL
0.16
like
0.15
against
0.15
dap
0.15
ottle
0.15
ermen
0.15
mote
0.14
506
0.14
Activations Density 0.317%