INDEX
Explanations
proper nouns, particularly names of organizations and media sources
New Auto-Interp
Negative Logits
s
-0.15
lou
-0.14
end
-0.14
oub
-0.14
aper
-0.14
rage
-0.14
.__
-0.14
онÑĮ
-0.14
anking
-0.14
bol
-0.13
POSITIVE LOGITS
enumerator
0.16
ails
0.15
anza
0.15
_alias
0.15
ÎŃαÏĤ
0.14
-wsj
0.14
plit
0.14
éļª
0.14
izard
0.14
HIR
0.14
Activations Density 0.019%