INDEX
    Explanations

    variations of the word "proof."

    New Auto-Interp
    Negative Logits
    ê»ĺ
    -0.08
    unch
    -0.07
    gi
    -0.07
    -gnu
    -0.07
    oral
    -0.07
    lle
    -0.07
    ughter
    -0.07
    iser
    -0.07
    alf
    -0.07
    vre
    -0.07
    POSITIVE LOGITS
    reading
    0.11
    lessly
    0.09
    reader
    0.07
    duc
    0.06
    íıIJ
    0.06
    -of
    0.06
    read
    0.06
    ing
    0.06
    ahn
    0.06
    /dis
    0.06
    Act Density 0.015%

    No Known Activations