INDEX
    Explanations

    proper names followed by pronouns

    New Auto-Interp
    Negative Logits
    ↵↵
    0.84
    3
    0.79
    تون
    0.77
    6
    0.77
    8
    0.75
    7
    0.68
    ból
    0.66
    0.66
    4
    0.65
    ٣
    0.65
    POSITIVE LOGITS
     and
    1.06
     in
    0.99
     be
    0.93
    ید
    0.90
    ish
    0.89
     I
    0.88
     an
    0.85
    ö
    0.82
    ale
    0.77
    ist
    0.76
    Act Density 0.001%

    No Known Activations