INDEX
Explanations
names of people or entities
the word "likes" and its variations
New Auto-Interp
Negative Logits
ŃĶ
-0.63
RELEASE
-0.60
INTON
-0.60
ccording
-0.60
exting
-0.57
unden
-0.56
gins
-0.55
english
-0.54
nonpartisan
-0.54
INA
-0.53
POSITIVE LOGITS
of
1.05
liest
1.05
paces
0.96
wikipedia
0.87
lihood
0.79
Of
0.79
thereof
0.76
creen
0.74
ãĤ¯
0.73
mith
0.73
Activations Density 0.024%