INDEX
    Explanations

    various forms of offensive language and profanity

    New Auto-Interp
    Negative Logits
    dale
    -0.15
    kinson
    -0.15
    cli
    -0.15
    âĦ
    -0.15
    /plugin
    -0.14
    yang
    -0.14
    arden
    -0.14
    ullo
    -0.14
    zel
    -0.14
    ernal
    -0.14
    POSITIVE LOGITS
    edd
    0.14
    éĦ
    0.13
    oten
    0.13
     Reef
    0.13
    YLE
    0.13
    elen
    0.13
     hur
    0.13
    룡
    0.13
    ihn
    0.13
     Duy
    0.13
    Act Density 0.019%

    No Known Activations