INDEX
    Explanations

    phrases that emphasize similarity or equivalence

    New Auto-Interp
    Negative Logits
    vernment
    -0.91
    heit
    -0.71
    schild
    -0.70
    ール
    -0.69
    numbered
    -0.68
     Supported
    -0.66
    netflix
    -0.65
    Interested
    -0.65
    -0.64
    senal
    -0.63
    POSITIVE LOGITS
     goes
    0.86
     applies
    0.81
     holds
    0.71
     occurs
    0.69
     happens
    0.68
     cannot
    0.66
     assumes
    0.65
    stuff
    0.64
     intuition
    0.64
     accum
    0.64
    Act Density 0.014%

    No Known Activations