INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    u
    1.81
    the
    1.75
    ف
    1.60
    1.58
    er
    1.45
    ro
    1.40
    ні
    1.33
    ر
    1.31
    f
    1.27
    A
    1.27
    POSITIVE LOGITS
    ли
    1.43
     
    1.13
    Cuando
    1.11
    1.06
     were
    1.05
     أ
    1.02
    ų
    1.00
     działal
    0.99
     étaient
    0.98
    什么
    0.95
    Act Density 0.003%

    No Known Activations