INDEX
Explanations
pronouns referring to oneself or themselves
reflexive pronouns and phrases related to self-reference
New Auto-Interp
Negative Logits
onal
-0.64
Mub
-0.63
cru
-0.61
iens
-0.60
aptic
-0.60
microsoft
-0.60
itty
-0.59
grade
-0.59
Alger
-0.58
emis
-0.57
POSITIVE LOGITS
é¾įåĸļ士
0.75
ãĥĥãĥĪ
0.74
ãĥķ
0.73
ãĤĭ
0.71
ãģ¾
0.70
ãģı
0.69
è»
0.68
çīĪ
0.67
åĤ
0.66
ãĥĹ
0.65
Activations Density 0.042%