INDEX
Explanations
keywords related to political contexts
references to politics
New Auto-Interp
Negative Logits
imates
-0.73
clud
-0.67
Vera
-0.66
actory
-0.66
untary
-0.63
Ratio
-0.62
Warrant
-0.61
Omni
-0.61
uran
-0.61
Cancel
-0.60
POSITIVE LOGITS
correctness
1.09
clinton
0.93
eering
0.92
manship
0.88
hip
0.83
chool
0.82
hare
0.81
politics
0.79
atism
0.78
esp
0.75
Activations Density 0.063%