INDEX
    Explanations

    phrases related to offensive language or behavior

    references to offensive content or material

    New Auto-Interp
    Negative Logits
    chell
    -0.92
    perature
    -0.68
    igrate
    -0.68
    hett
    -0.67
    population
    -0.67
    plates
    -0.66
    uther
    -0.66
    ho
    -0.65
    clerosis
    -0.65
    ãĤ£
    -0.64
    POSITIVE LOGITS
    thouse
    0.73
    bringer
    0.70
     thrust
    0.67
     Hebdo
    0.66
    ments
    0.65
     insensitive
    0.65
     humour
    0.64
     Cartoon
    0.64
     Wilde
    0.63
    ingly
    0.61
    Act Density 0.045%

    No Known Activations