INDEX
    Explanations

    instances of insults and derogatory language

    New Auto-Interp
    Negative Logits
    orr
    -0.07
    ilden
    -0.07
    yles
    -0.07
    ales
    -0.07
    gie
    -0.07
    erness
    -0.07
    stral
    -0.07
    ills
    -0.07
    elp
    -0.07
    over
    -0.07
    POSITIVE LOGITS
    ively
    0.09
    ingly
    0.09
    ably
    0.08
    uous
    0.08
    atory
    0.08
    antly
    0.07
    ive
    0.07
    271
    0.07
    urb
    0.06
    acios
    0.06
    Act Density 0.004%

    No Known Activations