INDEX
Explanations
discriminatory language related to race, gender, sexual orientation, and religion
instances of discriminatory language or terms related to prejudice and bigotry
New Auto-Interp
Negative Logits
hower
-0.87
leaf
-0.82
pletion
-0.79
flix
-0.77
change
-0.76
pring
-0.75
ources
-0.75
imum
-0.75
ership
-0.73
aper
-0.72
POSITIVE LOGITS
slurs
1.48
stereotypes
1.02
prejudice
1.00
jokes
0.99
homophobic
0.97
tir
0.96
slur
0.96
bigot
0.95
sexist
0.94
prejud
0.94
Activations Density 0.077%