INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    2
    -0.07
    id
    -0.07
    mıyor
    -0.07
     joyful
    -0.07
    -rating
    -0.07
    Freedom
    -0.06
    yy
    -0.06
     Gray
    -0.06
    $d
    -0.06
    348
    -0.06
    POSITIVE LOGITS
    ACH
    0.09
    ch
    0.09
    CH
    0.09
    itch
    0.08
    CHE
    0.08
    atch
    0.08
    arching
    0.08
    ouch
    0.08
    ach
    0.08
    noch
    0.08
    Act Density 0.078%

    No Known Activations