INDEX
Explanations
references to individuals' professions and nationalities
New Auto-Interp
Negative Logits
auen
-0.17
artz
-0.15
eil
-0.15
avir
-0.15
ecurity
-0.15
ej
-0.15
Current
-0.15
agner
-0.15
313
-0.14
terra
-0.14
POSITIVE LOGITS
states
0.31
polym
0.27
states
0.25
-states
0.25
bot
0.25
jur
0.23
politician
0.23
STATES
0.22
reform
0.22
States
0.21
Activations Density 0.183%