INDEX
Explanations
terms related to political discourse and ideology
New Auto-Interp
Negative Logits
roje
-0.16
ungal
-0.15
ode
-0.15
vez
-0.14
itler
-0.14
ead
-0.14
Pvt
-0.14
udge
-0.14
oje
-0.14
ansen
-0.14
POSITIVE LOGITS
incorrect
0.19
-economic
0.19
-admin
0.18
Parties
0.17
Incorrect
0.17
/admin
0.17
incorrect
0.16
atform
0.15
Gerr
0.15
correctness
0.15
Activations Density 0.041%