INDEX
Explanations
instances of profanity and offensive language
New Auto-Interp
Negative Logits
featureID
-0.96
]='\
-0.73
TestBed
-0.69
WithIOException
-0.68
bershka
-0.66
SOUNDBITE
-0.65
iNdEx
-0.64
__':
-0.63
inguém
-0.62
Moskva
-0.61
POSITIVE LOGITS
swear
1.07
swearing
1.04
swears
1.00
explicit
0.89
vulgar
0.85
prof
0.85
swore
0.82
language
0.81
NSFW
0.79
obscene
0.79
Activations Density 0.158%