INDEX
Explanations
phrases that indicate discrimination or biased judgments based on various criteria such as gender or race
New Auto-Interp
Negative Logits
uckland
-0.85
hiba
-0.74
jet
-0.70
along
-0.69
soon
-0.68
Edit
-0.67
Jet
-0.64
jam
-0.64
bats
-0.64
nin
-0.63
POSITIVE LOGITS
nationality
1.04
ethnicity
0.99
gender
0.90
sheer
0.90
resemblance
0.88
conscience
0.86
merit
0.86
whim
0.85
principles
0.85
disability
0.82
Activations Density 0.075%