INDEX
    Explanations

    instances of strong profanity and derogatory language

    New Auto-Interp
    Negative Logits
    nodoc
    -0.15
    sem
    -0.15
    Ì
    -0.14
    emens
    -0.14
    -kit
    -0.14
    ozem
    -0.14
    812
    -0.14
    zure
    -0.14
    .diag
    -0.14
    spec
    -0.14
    POSITIVE LOGITS
    abbo
    0.16
    ason
    0.15
    ppo
    0.15
    endale
    0.15
     Kidd
    0.14
    mue
    0.14
    eniable
    0.14
     Carlson
    0.14
    634
    0.13
    erguson
    0.13
    Act Density 0.027%

    No Known Activations