INDEX
Explanations
mentions of derogatory terms or prejudiced language
references to racism and bigotry
New Auto-Interp
Negative Logits
backup
-0.82
resil
-0.77
GOODMAN
-0.74
rotation
-0.74
refurb
-0.73
ufact
-0.73
refin
-0.73
oscopic
-0.70
remod
-0.69
renovations
-0.68
POSITIVE LOGITS
bigot
1.03
bigotry
0.97
coward
0.92
Semitic
0.88
slurs
0.86
unworthy
0.85
perpetrated
0.85
cowardly
0.84
insin
0.80
hypocrisy
0.79
Activations Density 0.978%