INDEX
    Explanations

    references to self-harm or violent acts directed at oneself

    references to suicide and related acts

    New Auto-Interp
    Negative Logits
     Correct
    -0.72
    parts
    -0.71
    rium
    -0.71
    heny
    -0.70
     Provided
    -0.70
    afort
    -0.69
    uv
    -0.68
     Phar
    -0.68
    artisan
    -0.68
    aunder
    -0.67
    POSITIVE LOGITS
     suicide
    1.31
    zai
    1.10
     bomber
    1.06
     bombers
    1.00
    icide
    0.98
    icides
    0.93
     suicides
    0.91
    itating
    0.88
    itated
    0.85
     suicidal
    0.83
    Act Density 0.015%

    No Known Activations