INDEX
    Explanations

    pronouns followed by actions

    New Auto-Interp
    Negative Logits
    0.60
    0.59
     on
    0.57
    트는
    0.57
    0.56
    менте
    0.56
    מ
    0.56
    0.54
    0.54
    ಾನೆ
    0.54
    POSITIVE LOGITS
    P
    0.61
     idő
    0.59
    o
    0.59
    V
    0.56
     blive
    0.55
     alcanz
    0.53
    و
    0.53
    Khi
    0.52
     julho
    0.52
     każdy
    0.52
    Act Density 0.087%

    No Known Activations