INDEX
Explanations
phrases related to offensive language or behavior
references to offensive content or material
New Auto-Interp
Negative Logits
chell
-0.92
perature
-0.68
igrate
-0.68
hett
-0.67
population
-0.67
plates
-0.66
uther
-0.66
ho
-0.65
clerosis
-0.65
ãĤ£
-0.64
POSITIVE LOGITS
thouse
0.73
bringer
0.70
thrust
0.67
Hebdo
0.66
ments
0.65
insensitive
0.65
humour
0.64
Cartoon
0.64
Wilde
0.63
ingly
0.61
Activations Density 0.045%