INDEX
Explanations
references to demographic topics or terms related to identity and social categorization
New Auto-Interp
Negative Logits
vet
-0.16
lied
-0.16
achs
-0.16
iously
-0.15
elib
-0.14
ogs
-0.14
度
-0.14
edback
-0.14
ificant
-0.14
bd
-0.14
POSITIVE LOGITS
dem
0.23
Dem
0.21
Dem
0.18
dem
0.17
DEM
0.17
urge
0.16
meni
0.15
214
0.15
ographics
0.15
stration
0.15
Activations Density 0.015%