INDEX
Explanations
pronouns and verbs, indicating personal connections or actions
New Auto-Interp
Negative Logits
jang
-0.17
-0.15
zilla
-0.15
kud
-0.14
CONTR
-0.14
GFX
-0.14
मर
-0.14
arnings
-0.14
assen
-0.13
令
-0.13
POSITIVE LOGITS
Notice
0.17
happens
0.16
notice
0.16
Fisher
0.16
ove
0.15
ย
0.15
ãĤ¤ãĤ¯
0.15
endor
0.15
happen
0.15
ja
0.15
Activations Density 0.009%