INDEX
Explanations
terms related to social classifications and distinctions, especially those with a negative or critical connotation
New Auto-Interp
Negative Logits
another
-0.16
Hra
-0.15
Hawth
-0.15
none
-0.14
with
-0.14
ươi
-0.14
twice
-0.14
as
-0.14
respectively
-0.14
many
-0.14
POSITIVE LOGITS
chal
0.20
-"
0.19
ilib
0.18
theless
0.17
-New
0.17
oner
0.16
ulas
0.16
(er
0.16
plib
0.16
atur
0.15
Activations Density 0.142%