INDEX
Explanations
actions involving physical violence or aggression
New Auto-Interp
Negative Logits
ặt
-0.15
unh
-0.15
unb
-0.15
igne
-0.15
ož
-0.14
ç©
-0.14
atik
-0.14
Bread
-0.14
Bever
-0.14
bread
-0.13
POSITIVE LOGITS
holm
0.16
OE
0.14
uty
0.14
oste
0.14
ologie
0.14
ëį
0.14
uten
0.14
éĻ£
0.14
zel
0.14
onne
0.14
Activations Density 0.178%