INDEX
Explanations
phrases related to various actions and their consequences
words and phrases related to damage or consequences
New Auto-Interp
Negative Logits
anus
-0.60
laus
-0.56
ª
-0.55
cknow
-0.53
giene
-0.52
OPLE
-0.51
Ħ¢
-0.50
rek
-0.49
TAIN
-0.49
ZI
-0.49
POSITIVE LOGITS
differently
0.79
nicely
0.65
everywhere
0.62
lin
0.62
indistinguishable
0.58
whereas
0.58
anyways
0.57
beautifully
0.57
automatically
0.57
MUCH
0.56
Activations Density 1.195%