INDEX
Explanations
negative actions or attributes associated with dehumanizing, demonizing, stigmatizing, or vilifying individuals or groups
words related to social stigmatization and dehumanization
New Auto-Interp
Negative Logits
UTERS
-0.69
oret
-0.62
negie
-0.60
enthusi
-0.59
uid
-0.59
INC
-0.58
cffff
-0.57
Prospect
-0.57
stead
-0.57
NH
-0.56
POSITIVE LOGITS
slurs
0.92
imaru
0.90
stereotypes
0.88
stigma
0.80
prejudice
0.77
vil
0.77
insults
0.76
bullies
0.76
stigmat
0.75
dehuman
0.74
Activations Density 0.079%