INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     disturbances
    -0.07
    -Man
    -0.07
     Man
    -0.07
    Running
    -0.07
    _running
    -0.07
     Running
    -0.07
    -kind
    -0.07
     disturbance
    -0.07
    Alive
    -0.07
     extent
    -0.07
    POSITIVE LOGITS
    0.17
     refusing
    0.14
    .reject
    0.14
     refused
    0.14
     rejects
    0.14
     رفض
    0.14
    Rejected
    0.14
     rejected
    0.14
     refusal
    0.14
    reject
    0.14
    Act Density 0.045%

    No Known Activations