INDEX
    Explanations

    instances of the word "reason" and its variations

    New Auto-Interp
    Negative Logits
    cat
    -0.15
    gow
    -0.14
     keyed
    -0.14
    kir
    -0.14
     Margin
    -0.14
    uye
    -0.14
    omp
    -0.13
    aur
    -0.13
    /run
    -0.13
    gia
    -0.13
    POSITIVE LOGITS
     why
    0.22
    why
    0.19
     dolayı
    0.16
    nant
    0.16
    nal
    0.16
    upert
    0.16
    lessly
    0.16
    EO
    0.16
    üstü
    0.16
     WHY
    0.15
    Act Density 0.032%

    No Known Activations