INDEX
Explanations
complex narratives surrounding moral dilemmas and criminal behavior
associated with evil
wickedness and evil acts
New Auto-Interp
Negative Logits
refugi
-0.63
soñ
-0.57
hadas
-0.54
unanje
-0.53
miot
-0.53
disparu
-0.52
lių
-0.52
Format
-0.52
escase
-0.51
walde
-0.51
POSITIVE LOGITS
evil
1.08
vile
1.04
vicious
0.99
diabo
0.96
wicked
0.95
cruel
0.95
depra
0.94
sinister
0.93
nef
0.91
despicable
0.89
Activations Density 0.509%