INDEX
    Explanations

    words related to causing harm or negative consequences

    New Auto-Interp
    Negative Logits
    ARCH
    -0.70
     liner
    -0.65
     Pione
    -0.64
     handy
    -0.63
    ipel
    -0.63
     Seasons
    -0.60
    ourn
    -0.60
    Jer
    -0.59
    uncture
    -0.59
    arity
    -0.59
    POSITIVE LOGITS
     harm
    1.31
    onies
    1.25
    lessly
    1.20
     harms
    1.06
    lessness
    0.92
     harming
    0.86
     Harm
    0.85
     endanger
    0.83
     harmed
    0.83
    espie
    0.81
    Act Density 0.008%

    No Known Activations