INDEX
Explanations
words related to controversial or important topics
references to specific problems or challenges
New Auto-Interp
Negative Logits
urses
-0.85
zin
-0.84
bsite
-0.80
inav
-0.77
ondon
-0.77
alt
-0.76
thood
-0.74
ancies
-0.74
htaking
-0.73
ellow
-0.73
POSITIVE LOGITS
issue
0.99
Issue
0.89
naires
0.82
Issue
0.82
DonaldTrump
0.81
HRC
0.77
Issues
0.76
issues
0.73
plag
0.73
Iss
0.69
Activations Density 0.034%