INDEX
Explanations
racial and derogatory terms
racial slurs
New Auto-Interp
Negative Logits
initial
-0.29
INITIAL
-0.26
Initial
-0.26
suitability
-0.25
mix
-0.25
line
-0.25
][
-0.24
?
-0.24
calon
-0.24
herself
-0.23
POSITIVE LOGITS
ConstraintMaker
1.11
nigga
1.09
nigger
1.05
UserScript
0.98
AssemblyTitle
0.94
Nig
0.91
Nig
0.90
niggas
0.90
richTextPanel
0.87
findpost
0.84
Activations Density 0.070%