INDEX
    Explanations

    words related to negative or offensive behavior or comments

    derogatory terms and references to socially taboo activities

    New Auto-Interp
    Negative Logits
     Plate
    -0.67
    spring
    -0.67
    Sources
    -0.65
    Source
    -0.65
     Source
    -0.62
    ensus
    -0.61
     Oral
    -0.60
     Celtic
    -0.59
     Sacrament
    -0.59
     Novel
    -0.59
    POSITIVE LOGITS
     jer
    1.22
     jerk
    1.13
    etsk
    0.97
    ithing
    0.89
    bucks
    0.88
    boa
    0.85
    balls
    0.84
    >>\
    0.84
    artifacts
    0.83
    EStream
    0.82
    Act Density 0.010%

    No Known Activations