INDEX
Explanations
references to racism and race-related topics
New Auto-Interp
Negative Logits
es
-0.21
ez
-0.16
andez
-0.15
ecurity
-0.15
accord
-0.15
urt
-0.14
esen
-0.14
ncia
-0.14
íĿ¥
-0.14
ancing
-0.14
POSITIVE LOGITS
coon
0.31
quet
0.28
rac
0.26
Rac
0.26
oon
0.25
oons
0.21
lette
0.21
rac
0.20
quete
0.20
etr
0.20
Activations Density 0.007%