INDEX
    Explanations

    negative or harmful language, such as derogatory or inappropriate terms

    terms related to negative or harmful content and behavior

    New Auto-Interp
    Negative Logits
    tested
    -0.86
    rage
    -0.86
    emis
    -0.82
    hung
    -0.81
    united
    -0.81
    wright
    -0.80
    lite
    -0.80
    abiding
    -0.79
    winning
    -0.78
    bender
    -0.78
    POSITIVE LOGITS
     behavior
    1.13
     materials
    1.13
     material
    1.12
     behaviour
    1.12
     activities
    1.11
     activity
    1.10
     conduct
    1.06
     behaviors
    1.06
     situations
    1.03
     items
    1.02
    Act Density 0.267%

    No Known Activations