INDEX
    Explanations

    mentions of potential risks or negative consequences

    New Auto-Interp
    Negative Logits
     admirable
    -0.65
    ãģ®é
    -0.64
    ãģ®éŃĶ
    -0.61
    Avg
    -0.61
    çͰ
    -0.61
    ellen
    -0.59
     courage
    -0.58
    inho
    -0.58
    hest
    -0.58
    aples
    -0.57
    POSITIVE LOGITS
     someday
    1.02
     repercussions
    0.96
    urrence
    0.91
     retribution
    0.88
    angering
    0.85
     relapse
    0.85
     apocalypse
    0.83
     contag
    0.81
     repr
    0.81
     poisoning
    0.80
    Act Density 0.350%

    No Known Activations