INDEX
Explanations
phrases related to controversies or political figures
instances of the word "us."
New Auto-Interp
Negative Logits
ottest
-0.75
regor
-0.74
rought
-0.67
merce
-0.65
jriwal
-0.64
owler
-0.63
skirts
-0.62
ITNESS
-0.62
attery
-0.61
payoff
-0.60
POSITIVE LOGITS
peed
1.03
pex
1.01
pecting
0.99
pect
0.97
sein
0.93
pects
0.93
cus
0.89
hee
0.88
aurus
0.86
cules
0.86
Activations Density 0.030%