INDEX
    Explanations

    phrases that introduce examples or instances

    New Auto-Interp
    Negative Logits
    ija
    -0.14
     indeed
    -0.14
    para
    -0.14
    ico
    -0.14
    idi
    -0.14
    iglia
    -0.14
    aps
    -0.14
    AT
    -0.13
    edo
    -0.13
    æģ¯
    -0.13
    POSITIVE LOGITS
     sake
    0.27
     purposes
    0.24
    :
    0.18
    :↵
    0.16
    ãģĪãģ°
    0.16
    èĢĮ
    0.15
    orz
    0.15
     když
    0.15
    æĿ¥è¯´
    0.14
    forth
    0.14
    Act Density 0.032%

    No Known Activations