INDEX
Explanations
countries, political figures, and controversial topics or actions
specific proper nouns and terms related to political and social issues
New Auto-Interp
Negative Logits
ume
-0.66
Gil
-0.66
toile
-0.61
Nanto
-0.60
Ern
-0.60
Brune
-0.58
menstrual
-0.57
denomin
-0.55
colle
-0.55
backdrop
-0.55
POSITIVE LOGITS
vantage
0.79
hadn
0.77
deserved
0.75
shouldn
0.74
should
0.71
bably
0.70
cheat
0.70
couldn
0.69
seless
0.69
helps
0.69
Activations Density 0.612%