INDEX
    Explanations

    phrases discussing justifications or explanations for actions or beliefs

    New Auto-Interp
    Negative Logits
    gow
    -0.17
    uye
    -0.15
    /run
    -0.15
    каÑģ
    -0.15
    ay
    -0.14
    achi
    -0.14
     moy
    -0.14
    uy
    -0.14
    ipa
    -0.14
    /read
    -0.13
    POSITIVE LOGITS
     why
    0.22
    why
    0.19
    üstü
    0.18
    lessly
    0.17
    hift
    0.17
    nement
    0.16
    APPER
    0.16
     dolayı
    0.16
    afort
    0.16
    nal
    0.16
    Act Density 0.047%

    No Known Activations