INDEX
    Explanations

    references to harm or risk associated with actions or situations

    New Auto-Interp
    Negative Logits
     viață
    -0.48
     cementerio
    -0.48
     cesse
    -0.48
    JMenuBar
    -0.46
    ẢN
    -0.46
     Varian
    -0.45
     gelassen
    -0.45
    ,:);
    -0.45
    -0.44
    رسال
    -0.43
    POSITIVE LOGITS
     harmed
    1.03
     harming
    0.97
     harm
    0.93
     harms
    0.92
    harmed
    0.88
     hurting
    0.87
    脚注の使い方
    0.87
     Harm
    0.85
    harm
    0.85
     hurt
    0.82
    Act Density 0.347%

    No Known Activations