INDEX
    Explanations

    references to physical harm or damage

    New Auto-Interp
    Negative Logits
    liam
    -0.79
    atari
    -0.77
     Democr
    -0.69
    ugi
    -0.65
     seldom
    -0.64
     unden
    -0.62
     Millenn
    -0.61
     ende
    -0.61
    zhen
    -0.60
     perpetually
    -0.60
    POSITIVE LOGITS
     nor
    1.18
     wrongdoing
    1.07
     harmed
    1.03
     anything
    1.02
     anybody
    0.93
     whatsoever
    0.91
    threatening
    0.89
    anything
    0.89
     any
    0.88
     anyone
    0.87
    Act Density 0.360%

    No Known Activations