INDEX
Explanations
references to specific people or events, especially related to announcements or declarations
references to specific individuals or events in sports and culture
New Auto-Interp
Negative Logits
keyboards
-0.57
slur
-0.51
fuzz
-0.51
Occasionally
-0.51
thyroid
-0.47
tidy
-0.46
innocuous
-0.45
manic
-0.44
feminine
-0.44
stereotype
-0.44
POSITIVE LOGITS
cember
0.63
OUP
0.60
ETHOD
0.59
numbered
0.58
DonaldTrump
0.57
HQ
0.55
itialized
0.54
DEM
0.54
razil
0.54
ģ«
0.53
Activations Density 2.629%