INDEX
Explanations
words or phrases related to politics
references to political concepts or discussions
New Auto-Interp
Negative Logits
olen
-0.83
xt
-0.80
eret
-0.80
hem
-0.79
IER
-0.78
oning
-0.77
imus
-0.76
Cancel
-0.75
alin
-0.74
cellent
-0.72
POSITIVE LOGITS
correctness
1.24
persuasion
1.04
affili
0.97
affiliation
0.97
activism
0.97
affairs
0.92
satire
0.91
fallout
0.91
partisans
0.91
ideology
0.90
Activations Density 0.033%