INDEX
Explanations
references to evil and malevolent actions or characters
New Auto-Interp
Negative Logits
etch
-0.17
acles
-0.16
icle
-0.16
пок
-0.16
ingly
-0.16
еÑĩ
-0.15
ĤŃ
-0.15
ech
-0.15
olina
-0.15
LIGHT
-0.14
POSITIVE LOGITS
ness
0.18
ution
0.18
deeds
0.17
-do
0.16
lest
0.16
intent
0.15
UTION
0.15
ulence
0.15
nature
0.15
Bunny
0.15
Activations Density 0.024%