INDEX
Explanations
words related to discrimination, bias, and prejudice
terms associated with extreme or derogatory labeling and bias
New Auto-Interp
Negative Logits
aird
-0.78
stellar
-0.73
stable
-0.72
frames
-0.70
Indigo
-0.69
quart
-0.68
Sync
-0.67
erald
-0.67
Chrys
-0.66
oglobin
-0.65
POSITIVE LOGITS
tactics
1.06
intimidation
1.01
blackmail
1.00
extortion
0.90
spying
0.89
perpetrated
0.89
retaliation
0.88
accusations
0.88
abuses
0.87
threats
0.87
Activations Density 0.283%