INDEX
    Explanations

    describing an intermediate level

    New Auto-Interp
    Negative Logits
    1
    1.46
    م
    1.36
    א
    1.18
    ן
    1.13
     את
    1.10
    1.09
     in
    1.06
     בה
    1.05
     למ
    1.04
    ır
    1.03
    POSITIVE LOGITS
    t
    1.37
    ר
    1.33
    н
    1.22
    1.22
    a
    1.20
     moderate
    1.16
    ن
    1.16
    تي
    1.15
    ت
    1.15
    ر
    1.09
    Act Density 0.010%

    No Known Activations