INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ل
    1.85
    د
    1.85
    ع
    1.60
     في
    1.58
    ر
    1.53
    л
    1.49
    خ
    1.49
    ش
    1.46
    1.45
    at
    1.43
    POSITIVE LOGITS
     are
    1.66
     were
    1.22
     đều
    1.11
    ate
    1.08
     
    1.00
     WERE
    0.98
    -
    0.95
    0.95
     ARE
    0.95
    eli
    0.94
    Act Density 0.371%

    No Known Activations