INDEX
    Explanations

    references to suicide and self-harm

    New Auto-Interp
    Negative Logits
     original
    -0.55
     Den
    -0.55
     (
    -0.53
     den
    -0.53
    hi
    -0.51
    den
    -0.50
    <eos>
    -0.49
     miss
    -0.48
    ↵↵
    -0.48
    ,
    -0.48
    POSITIVE LOGITS
     suicide
    1.87
     suicides
    1.67
    suicide
    1.66
    Suicide
    1.57
     Suicide
    1.55
     suicidio
    1.47
     suic
    1.30
     suicidal
    1.24
    自殺
    1.15
     myſelf
    1.04
    Act Density 0.155%

    No Known Activations