INDEX
Explanations
phrases that express defiance against social norms and challenges to stereotypes
New Auto-Interp
Negative Logits
ysl
-0.16
638
-0.15
Nobel
-0.15
Backing
-0.15
amerate
-0.14
訴
-0.13
à¤Ĩत
-0.13
armac
-0.13
ÃŃses
-0.12
constitutional
-0.12
POSITIVE LOGITS
convention
0.38
conventions
0.35
conventional
0.33
established
0.33
accepted
0.31
norms
0.30
expectations
0.29
orth
0.28
Convention
0.28
establishment
0.27
Activations Density 0.260%