INDEX
    Explanations

    phrases indicating realizations or discoveries

    New Auto-Interp
    Negative Logits
    I
    -0.16
    seen
    -0.15
    j
    -0.15
     seen
    -0.14
     anon
    -0.14
    fik
    -0.14
     initial
    -0.14
    ãģ§ãģĹãĤĩãģĨ
    -0.14
    oci
    -0.14
     known
    -0.14
    POSITIVE LOGITS
    åİŁæĿ¥
    0.23
     actually
    0.20
    actually
    0.20
    Actually
    0.18
     Actually
    0.17
     indeed
    0.17
    iÄįky
    0.16
    竣
    0.16
    @",
    0.15
    agal
    0.14
    Act Density 0.238%

    No Known Activations