INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    é
    1.17
    á
    1.16
     on
    1.07
     is
    1.05
    sthe
    0.97
     '
    0.95
    ]
    0.95
    ethe
    0.93
    .
    0.93
    ä
    0.91
    POSITIVE LOGITS
    in
    1.48
    ة
    1.26
    ב
    1.24
    at
    1.21
    ور
    1.12
    اك
    1.11
    اا
    1.03
    innt
    1.02
     مي‌
    1.01
     deprive
    0.98
    Act Density 0.000%

    No Known Activations