INDEX
    Explanations

    phrases related to causing harm or damage to others

    words or phrases that indicate negative impacts or harm to individuals or communities

    New Auto-Interp
    Negative Logits
    uther
    -0.74
    alled
    -0.74
    aut
    -0.73
    iliary
    -0.70
    ult
    -0.70
    ulum
    -0.68
    au
    -0.68
    ials
    -0.68
    Nap
    -0.67
    ault
    -0.67
    POSITIVE LOGITS
     hurting
    1.11
     disadvant
    0.98
     adolesc
    0.92
     harming
    0.91
     undermin
    0.85
     badly
    0.85
     horribly
    0.83
    lehem
    0.82
     Pwr
    0.81
     harmed
    0.80
    Act Density 0.010%

    No Known Activations