INDEX
    Explanations

    comparisons emphasizing similarity and equivalence

    New Auto-Interp
    Negative Logits
    letic
    -0.16
    cow
    -0.16
    onga
    -0.16
    illac
    -0.16
    iller
    -0.15
     bas
    -0.15
    ãĥªãĥ¼ãĤº
    -0.15
    enor
    -0.14
    ipop
    -0.14
    à¥įतन
    -0.14
    POSITIVE LOGITS
     they
    0.23
     others
    0.19
     she
    0.18
     THEY
    0.18
     we
    0.17
     he
    0.17
     manner
    0.17
     did
    0.16
    h
    0.16
    they
    0.16
    Act Density 0.084%

    No Known Activations