INDEX
    Explanations

    phrases indicating different perspectives or ways of thinking

    New Auto-Interp
    Negative Logits
    nad
    -0.19
    lsru
    -0.16
    ģn
    -0.15
    ën
    -0.15
    βε
    -0.14
    jac
    -0.14
    ogan
    -0.14
    elda
    -0.14
     alors
    -0.14
    lette
    -0.13
    POSITIVE LOGITS
    ward
    0.17
    oins
    0.15
    fv
    0.15
     cand
    0.15
     sigmoid
    0.14
    isphere
    0.14
    LTR
    0.14
     Russo
    0.14
    inton
    0.14
     wij
    0.14
    Act Density 0.032%

    No Known Activations