INDEX
    Explanations

    reasoning or explanation related to decisions

    New Auto-Interp
    Negative Logits
    lez
    -0.84
    wn
    -0.77
    nin
    -0.71
    agin
    -0.70
    Gas
    -0.68
    iac
    -0.67
    hal
    -0.66
    ax
    -0.66
    gall
    -0.65
    yan
    -0.64
    POSITIVE LOGITS
     they
    1.01
     otherwise
    0.92
     nobody
    0.87
     unlike
    0.83
     there
    0.82
    */(
    0.81
     it
    0.81
     THEY
    0.79
     obviously
    0.71
     we
    0.70
    Act Density 2.933%

    No Known Activations