INDEX
    Explanations

    words that begin with the letter 'w'

    New Auto-Interp
    Negative Logits
    Helpers
    -0.15
    andest
    -0.15
    lake
    -0.14
    698
    -0.14
    stras
    -0.14
     Düz
    -0.14
    ibar
    -0.14
    ÙħÙĦØ©
    -0.14
    arie
    -0.14
    336
    -0.13
    POSITIVE LOGITS
     w
    0.22
    ering
    0.19
    ideo
    0.17
    anj
    0.16
    nder
    0.16
    =w
    0.16
     jam
    0.15
    idd
    0.15
    itten
    0.15
    ingly
    0.15
    Act Density 0.025%

    No Known Activations