INDEX
    Explanations

    references to purity or pure concepts

    New Auto-Interp
    Negative Logits
    shelf
    -0.19
    perature
    -0.16
    uality
    -0.15
    itr
    -0.15
    reuse
    -0.15
    atical
    -0.15
    sel
    -0.15
    chy
    -0.15
    mun
    -0.14
    ÏĥÏĩ
    -0.14
    POSITIVE LOGITS
    bred
    0.34
    pure
    0.30
     Pure
    0.28
     pure
    0.26
    Pure
    0.26
    foy
    0.25
    st
    0.24
    PURE
    0.23
    ç²
    0.23
    eing
    0.23
    Act Density 0.015%

    No Known Activations