INDEX
Explanations
proper nouns, particularly names of individuals and titles
New Auto-Interp
Negative Logits
aju
-0.16
ined
-0.14
ses
-0.14
chaired
-0.14
ocs
-0.14
coincidence
-0.14
abr
-0.13
cha
-0.13
odel
-0.13
ij
-0.13
POSITIVE LOGITS
is
0.19
began
0.18
æĺ¯ä¸Ģ
0.17
unsch
0.17
isa
0.17
born
0.16
adalah
0.16
earned
0.16
became
0.16
æĺ¯
0.16
Activations Density 0.086%