INDEX
Explanations
mentions of personal identities and expressions of disagreement
New Auto-Interp
Negative Logits
ABE
-0.75
fodder
-0.70
PBS
-0.69
democracy
-0.65
Skydragon
-0.65
ballots
-0.64
tiers
-0.63
ROS
-0.62
democracies
-0.62
hydrogen
-0.61
POSITIVE LOGITS
m
1.26
mean
1.20
t
1.16
want
1.12
mad
1.11
ll
1.09
ma
1.08
never
1.06
felt
1.06
say
1.05
Activations Density 0.079%