INDEX
Explanations
features or characteristics associated with individuals or groups
words or phrases related to labels and stereotypes
New Auto-Interp
Negative Logits
large
-0.81
usable
-0.74
ij
-0.73
range
-0.73
oho
-0.71
english
-0.70
ecause
-0.69
ensive
-0.69
CLOSE
-0.68
angan
-0.68
POSITIVE LOGITS
extraord
1.18
gery
1.05
esses
1.04
hood
1.03
doms
1.01
ry
0.95
hordes
0.94
dom
0.92
isms
0.91
ism
0.90
Activations Density 0.350%