INDEX
Explanations
negative or offensive language and comments
references to offensive comments and remarks
New Auto-Interp
Negative Logits
oglu
-0.74
negie
-0.73
prus
-0.72
sonian
-0.71
Luck
-0.69
UNCH
-0.69
iets
-0.69
inav
-0.69
aer
-0.68
Luck
-0.68
POSITIVE LOGITS
inappropriate
1.13
inappropriately
1.13
slurs
1.09
lewd
1.05
disrespectful
1.04
indecent
1.02
abusive
1.01
harassing
1.00
misogyn
0.95
uttered
0.94
Activations Density 0.349%