INDEX
    Explanations

    profane and derogatory terms

    derogatory terms and phrases related to unflattering behaviors or characteristics

    New Auto-Interp
    Negative Logits
     Corm
    -0.64
    Source
    -0.64
    WER
    -0.63
     Prophe
    -0.61
     Novel
    -0.61
     divisions
    -0.59
    ãģ®ç
    -0.59
    Feature
    -0.59
    rics
    -0.57
     Defender
    -0.57
    POSITIVE LOGITS
     jer
    1.17
     jerk
    1.14
    usalem
    1.03
    weed
    0.86
    offs
    0.82
    boa
    0.82
    ometer
    0.80
    ety
    0.80
    bucks
    0.80
    itude
    0.79
    Act Density 0.021%

    No Known Activations