INDEX
Explanations
words related to political partisanship
references to partisan politics
New Auto-Interp
Negative Logits
ternally
-0.80
ofi
-0.80
worm
-0.79
uras
-0.78
ept
-0.77
enium
-0.76
uran
-0.76
ees
-0.75
worms
-0.74
ulet
-0.74
POSITIVE LOGITS
affiliation
0.93
partisans
0.90
partisan
0.89
affili
0.85
leaning
0.83
bias
0.81
politics
0.77
loyalty
0.77
persuasion
0.77
correctness
0.76
Activations Density 0.040%