INDEX
Explanations
references to violence and brutality
New Auto-Interp
Negative Logits
Dangerous
-0.16
Danger
-0.15
dangerous
-0.15
Mint
-0.15
danger
-0.14
anes
-0.14
Danger
-0.14
chner
-0.14
menacing
-0.14
Paran
-0.14
POSITIVE LOGITS
dec
0.25
dissect
0.24
dis
0.22
hacked
0.20
viv
0.20
decomposition
0.20
decom
0.19
киÑĪ
0.19
mutil
0.19
scal
0.19
Activations Density 0.223%