INDEX
Explanations
words related to negative actions or consequences
New Auto-Interp
Negative Logits
/umd
-0.18
esser
-0.16
GuidId
-0.15
kino
-0.14
alsy
-0.14
Souls
-0.14
ÅĽcie
-0.14
assis
-0.14
vala
-0.14
èľĺèĽĽ
-0.14
POSITIVE LOGITS
Club
0.23
Club
0.20
Frozen
0.19
Penguin
0.19
Snow
0.18
cp
0.18
frozen
0.18
sled
0.18
Snow
0.18
club
0.17
Activations Density 0.001%