INDEX
Explanations
terms related to malevolence or wickedness
New Auto-Interp
Negative Logits
ylül
-0.17
posable
-0.17
eron
-0.17
laz
-0.16
лаж
-0.16
ÅĻeba
-0.15
egis
-0.14
tle
-0.14
çͲ
-0.14
ussen
-0.14
POSITIVE LOGITS
ution
0.29
deeds
0.26
-do
0.26
ness
0.25
intent
0.23
intentions
0.21
ulence
0.20
intent
0.20
deed
0.20
intents
0.20
Activations Density 0.026%