INDEX
Explanations
words related to bias, discrimination, and negative preconceived notions about certain groups of people
terms related to prejudice and discrimination
New Auto-Interp
Negative Logits
Interstitial
-0.85
VID
-0.81
sis
-0.76
adra
-0.75
avez
-0.75
ascus
-0.74
incinn
-0.73
ODE
-0.72
irgin
-0.70
ramid
-0.70
POSITIVE LOGITS
prejudice
1.28
prejud
1.20
prejudices
0.95
eering
0.83
icial
0.78
hatred
0.78
intolerance
0.74
ophobic
0.74
wart
0.73
towards
0.70
Activations Density 0.011%