INDEX
    Explanations

    the concept of "reason" related to various explanations or justifications

    New Auto-Interp
    Negative Logits
    gow
    -0.20
    IRST
    -0.16
    anzeigen
    -0.15
    pery
    -0.15
    /read
    -0.14
    erson
    -0.14
    nez
    -0.14
    /lists
    -0.14
    /run
    -0.14
    æijĩ
    -0.14
    POSITIVE LOGITS
     why
    0.23
    why
    0.20
    nal
    0.18
    lessly
    0.17
    naires
    0.16
    hift
    0.16
    APPER
    0.16
    üstü
    0.16
     WHY
    0.15
    ably
    0.15
    Act Density 0.039%

    No Known Activations