INDEX
    Explanations

    instances of actions or statements that could be seen as harmful or disruptive

    phrases that convey action or inquiry

    New Auto-Interp
    Negative Logits
    nces
    -0.75
    abouts
    -0.71
    sequent
    -0.71
    etheless
    -0.70
    webkit
    -0.68
    Keys
    -0.68
     thereafter
    -0.66
    tions
    -0.66
    iann
    -0.65
    none
    -0.62
    POSITIVE LOGITS
     wrong
    0.93
     Wrong
    0.84
     delusional
    0.81
     something
    0.80
     mischief
    0.80
     miscon
    0.78
     hypocr
    0.74
     hay
    0.73
     backwards
    0.72
     misunderstood
    0.72
    Act Density 0.471%

    No Known Activations