INDEX
Explanations
instances of the word "wrong" and related expressions indicating mistakes or moral failings
New Auto-Interp
Negative Logits
.qual
-0.16
اÙĨÙĪ
-0.15
anki
-0.15
lesi
-0.15
arters
-0.15
lsa
-0.15
illet
-0.15
/cli
-0.15
apa
-0.14
cla
-0.14
POSITIVE LOGITS
headed
0.43
fully
0.40
-headed
0.37
/right
0.30
er
0.30
ed
0.30
wrong
0.29
wrong
0.28
eous
0.27
est
0.27
Activations Density 0.066%