INDEX
    Explanations

    references to attacks or aggressive actions

    New Auto-Interp
    Negative Logits
    {~
    -0.76
    Ав
    -0.69
     whole
    -0.66
     թվական
    -0.64
    nO
    -0.64
     gu
    -0.63
     ύ
    -0.61
     км
    -0.61
    RUNTIME
    -0.61
     Beans
    -0.60
    POSITIVE LOGITS
     Attack
    1.79
     ATTACK
    1.69
     attack
    1.66
     attacks
    1.61
     Attacks
    1.59
    Attacks
    1.56
    ATTACK
    1.55
    attack
    1.55
    Attack
    1.46
    attacks
    1.46
    Act Density 0.067%

    No Known Activations