INDEX
Explanations
mentions of racism-related terms
terms related to race and racism
New Auto-Interp
Negative Logits
tip
-0.71
Lily
-0.66
Dill
-0.66
Joint
-0.62
payload
-0.62
Vita
-0.62
Vera
-0.61
Patient
-0.61
lettuce
-0.60
EFF
-0.59
POSITIVE LOGITS
rac
4.60
race
1.91
racist
1.82
Rac
1.77
Race
1.33
rag
1.20
racial
1.18
ran
1.17
rab
1.15
ras
1.15
Activations Density 0.014%