INDEX
Explanations
pronouns and verbs describing actions or states of being
New Auto-Interp
Negative Logits
umm
-0.16
sage
-0.15
adla
-0.14
ÙģÙĪØ±
-0.14
alu
-0.14
-demand
-0.13
longleftrightarrow
-0.13
wy
-0.13
ura
-0.13
tog
-0.13
POSITIVE LOGITS
esis
0.16
aunch
0.14
gings
0.14
endon
0.14
eri
0.14
ãĥ¼ãĥģ
0.14
ammer
0.14
Count
0.14
enen
0.14
erve
0.13
Activations Density 0.009%