INDEX
    Explanations

    human rights violations and abuses

    New Auto-Interp
    Negative Logits
     fury
    0.45
    愤怒
    0.43
     hurting
    0.42
    0.40
     harm
    0.40
     baddies
    0.40
     traged
    0.39
     اتھار
    0.39
     harmed
    0.39
     adversity
    0.39
    POSITIVE LOGITS
     arbitrary
    0.99
     torture
    0.89
     extra
    0.80
     tort
    0.80
    Tort
    0.78
     Tort
    0.76
    tort
    0.75
     Arbit
    0.72
     ekstra
    0.71
     executions
    0.70
    Act Density 0.011%

    No Known Activations