INDEX
Explanations
terms related to political topics
references to politics and political issues
New Auto-Interp
Negative Logits
uran
-0.90
actory
-0.88
imates
-0.80
ibles
-0.79
amination
-0.78
val
-0.75
reek
-0.74
orage
-0.74
eret
-0.71
gur
-0.69
POSITIVE LOGITS
correctness
1.02
eering
0.92
manship
0.85
hare
0.81
cape
0.81
atism
0.77
governing
0.76
lawy
0.76
clinton
0.75
intrig
0.74
Activations Density 0.018%