INDEX
Explanations
phrases or words indicating a contrast or contradiction
phrases or statements that challenge popular beliefs or expectations
New Auto-Interp
Negative Logits
oided
-0.86
ross
-0.71
estones
-0.70
azz
-0.70
BSD
-0.67
lov
-0.66
among
-0.65
CLA
-0.65
-0.65
ROM
-0.63
POSITIVE LOGITS
prevailing
0.78
stereotypical
0.78
stereotypes
0.78
stereotype
0.77
ptions
0.71
expectations
0.71
conventional
0.71
belie
0.70
belief
0.69
usual
0.68
Activations Density 0.129%