INDEX
Explanations
words related to bigotry, discrimination, and negative attitudes towards certain groups or individuals
references to bigotry and related negative behaviors
New Auto-Interp
Negative Logits
confir
-0.71
pring
-0.69
Cancel
-0.69
Accounting
-0.67
ournal
-0.67
OA
-0.66
birth
-0.66
synchronization
-0.63
Contracts
-0.63
%"
-0.63
POSITIVE LOGITS
uously
0.86
ifiers
0.84
bigotry
0.82
etooth
0.80
bigot
0.80
ifying
0.78
ifiable
0.78
itude
0.77
izer
0.75
itious
0.75
Activations Density 0.013%