INDEX
Explanations
words related to opinions or actions being unpopular
references to unpopularity
New Auto-Interp
Negative Logits
EStreamFrame
-0.81
erning
-0.78
ramid
-0.76
initely
-0.75
hens
-0.75
chn
-0.74
llular
-0.72
utics
-0.71
arnaev
-0.71
ynthesis
-0.70
POSITIVE LOGITS
ity
1.19
unpopular
1.12
incumbent
0.94
ities
0.89
majorities
0.82
burdens
0.76
incumb
0.75
nesses
0.74
lihood
0.73
taboo
0.72
Activations Density 0.021%