INDEX
Explanations
derogatory language or remarks
terms related to racial slurs and derogatory language
New Auto-Interp
Negative Logits
compan
-0.70
session
-0.68
angel
-0.66
NetMessage
-0.62
ocr
-0.61
growth
-0.61
reconc
-0.61
packing
-0.61
Folder
-0.61
Whe
-0.60
POSITIVE LOGITS
slurs
1.43
slur
1.24
pees
0.80
plings
0.80
rimination
0.78
dispar
0.76
iple
0.74
ï¸ı
0.73
guiActiveUn
0.72
insults
0.71
Activations Density 0.013%