INDEX
Explanations
references to discrimination based on various criteria such as race, gender, sexual orientation, and physical characteristics
concepts related to discrimination and bias based on various personal characteristics
New Auto-Interp
Negative Logits
Reviewer
-0.76
Purg
-0.66
metal
-0.65
jet
-0.64
bage
-0.63
Sunder
-0.62
////////////////////////////////
-0.61
bj
-0.61
uckland
-0.61
invoke
-0.61
POSITIVE LOGITS
ethnicity
1.12
nationality
1.11
gender
1.02
geography
0.92
colour
0.91
severity
0.88
likeness
0.87
resemblance
0.85
proximity
0.85
color
0.85
Activations Density 0.296%