INDEX
    Explanations

    phrases related to actions causing negative outcomes or harm

    instances of causation leading to negative outcomes

    New Auto-Interp
    Negative Logits
     Horizons
    -0.61
    itar
    -0.61
     demos
    -0.60
    gaard
    -0.60
     follow
    -0.59
    oped
    -0.59
     Transition
    -0.58
     text
    -0.58
    itarian
    -0.57
     interviews
    -0.57
    POSITIVE LOGITS
     causing
    3.32
     inflicting
    1.97
     preventing
    1.84
     harming
    1.78
     injuring
    1.77
     affecting
    1.73
     disrupting
    1.72
     ruining
    1.71
     triggering
    1.71
     provoking
    1.70
    Act Density 0.025%

    No Known Activations