INDEX
    Explanations

    terms related to harm, including physical, emotional, or potential danger

    mentions of harm, particularly in relation to various contexts and populations

    New Auto-Interp
    Negative Logits
    umen
    -0.67
    uren
    -0.65
    ometer
    -0.64
    READ
    -0.63
    enhagen
    -0.63
    Fan
    -0.61
    aten
    -0.60
    Completed
    -0.59
     filled
    -0.59
    Base
    -0.58
    POSITIVE LOGITS
     harm
    3.87
     harms
    2.86
     harmed
    2.15
     Harm
    2.06
    harm
    1.98
     harming
    1.94
     hurt
    1.65
     damage
    1.61
     harmful
    1.57
     endanger
    1.53
    Act Density 0.021%

    No Known Activations