INDEX
    Explanations

    negative sentiments and adverse outcomes

    harm, falsehoods, or errors

    New Auto-Interp
    Negative Logits
     GenerationType
    -0.76
    -0.62
     stage
    -0.60
     BoxFit
    -0.57
    :✨
    -0.54
     special
    -0.54
     uLocal
    -0.52
     Stage
    -0.51
     preside
    -0.51
    seeds
    -0.51
    POSITIVE LOGITS
     Gewalt
    0.43
    iestety
    0.42
     ويكيپيديا
    0.39
     śmier
    0.37
     victimes
    0.34
    KURZBESCHREIBUNG
    0.33
    locaust
    0.33
     Violence
    0.33
     niestety
    0.33
    toxicity
    0.33
    Act Density 0.466%

    No Known Activations