INDEX
Explanations
expressions of moral judgment or wrongdoing
New Auto-Interp
Negative Logits
antar
-0.18
apa
-0.17
лами
-0.16
اÙĨÙĪ
-0.16
uai
-0.16
wine
-0.16
anki
-0.15
traits
-0.15
coni
-0.15
.qual
-0.15
POSITIVE LOGITS
fully
0.33
headed
0.31
s
0.26
/right
0.26
wrong
0.25
-headed
0.25
wrong
0.23
WRONG
0.23
Wrong
0.21
Wrong
0.21
Activations Density 0.050%