INDEX
Explanations
references to race and racial issues.
New Auto-Interp
Negative Logits
racing
-0.16
Racing
-0.16
races
-0.16
.metamodel
-0.16
iram
-0.15
IRA
-0.14
Baz
-0.14
ares
-0.14
inition
-0.14
eling
-0.14
POSITIVE LOGITS
profiling
0.26
/color
0.21
-prof
0.21
cleansing
0.21
Cleans
0.19
ized
0.19
profiler
0.18
pride
0.18
harmony
0.18
epith
0.18
Activations Density 0.033%