INDEX
Explanations
pronouns indicating self-directed actions, particularly emphasizing belief in oneself
New Auto-Interp
Negative Logits
Mub
-0.63
microsoft
-0.62
onal
-0.62
Nou
-0.62
Lens
-0.60
itty
-0.59
ency
-0.58
cru
-0.57
Alger
-0.57
grade
-0.57
POSITIVE LOGITS
ortium
0.71
ãģ¾
0.71
ãģı
0.70
ãĤĭ
0.70
ãĥķ
0.70
é¾įåĸļ士
0.70
sanct
0.68
ãĥĥãĥĪ
0.67
ãģį
0.66
çīĪ
0.66
Activations Density 0.040%