INDEX
    Explanations

    instances of profanity and offensive language

    New Auto-Interp
    Negative Logits
    featureID
    -0.96
    ]='\
    -0.73
    TestBed
    -0.69
    WithIOException
    -0.68
     bershka
    -0.66
    SOUNDBITE
    -0.65
     iNdEx
    -0.64
    __':
    
    -0.63
    inguém
    -0.62
     Moskva
    -0.61
    POSITIVE LOGITS
     swear
    1.07
     swearing
    1.04
     swears
    1.00
     explicit
    0.89
     vulgar
    0.85
     prof
    0.85
     swore
    0.82
     language
    0.81
     NSFW
    0.79
     obscene
    0.79
    Act Density 0.158%

    No Known Activations