INDEX
Explanations
mentions of politicians
mentions of politicians
New Auto-Interp
Negative Logits
actory
-0.76
urious
-0.72
ventory
-0.70
uran
-0.69
lights
-0.66
Cancel
-0.66
gged
-0.65
east
-0.65
Condition
-0.65
IER
-0.64
POSITIVE LOGITS
clinton
0.98
hips
0.80
appoint
0.77
icians
0.77
correctness
0.75
impe
0.71
hip
0.70
woman
0.69
bent
0.69
jriwal
0.67
Activations Density 0.029%