INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    1.24
     is
    1.09
    as
    1.02
    0.97
    on
    0.95
     be
    0.91
    h
    0.91
    ا
    0.86
    v
    0.83
    我们
    0.82
    POSITIVE LOGITS
    ین
    1.55
    ینا
    1.22
    ir
    1.20
    ate
    1.16
    ینگ
    1.09
    ینس
    1.07
    νες
    1.02
    (
    1.02
    িতে
    1.01
    یر
    1.01
    Act Density 0.012%

    No Known Activations