INDEX
Explanations
references to political figures and institutions
New Auto-Interp
Negative Logits
eyse
-0.16
ropoda
-0.16
idor
-0.16
utin
-0.16
ruba
-0.15
jure
-0.15
ochen
-0.15
ecure
-0.14
eldre
-0.14
æ¦ľ
-0.14
POSITIVE LOGITS
who
0.62
whose
0.46
who
0.43
whom
0.35
whose
0.33
Who
0.33
with
0.32
Who
0.30
quien
0.29
qui
0.28
Activations Density 0.640%