INDEX
    Explanations

    phrases related to reasoning or causality

    New Auto-Interp
    Negative Logits
    wn
    -0.74
    lez
    -0.74
    ax
    -0.70
    robe
    -0.70
    agin
    -0.69
    mint
    -0.67
    nin
    -0.66
    Gas
    -0.64
    hal
    -0.64
    age
    -0.64
    POSITIVE LOGITS
     they
    1.09
     there
    0.91
     nobody
    0.90
     it
    0.87
     THEY
    0.85
     otherwise
    0.83
     we
    0.81
     unlike
    0.80
    */(
    0.78
     he
    0.75
    Act Density 1.159%

    No Known Activations