INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ن
    1.36
    the
    1.33
    н
    1.29
    1.12
    to
    1.10
    ت
    1.01
    ه
    0.99
    0.98
    ق
    0.95
     was
    0.91
    POSITIVE LOGITS
    0.96
    行う
    0.89
    З
    0.82
    기를
    0.82
    세요
    0.81
    gång
    0.79
    atures
    0.78
    ഡ്
    0.77
    .…
    0.77
    g
    0.76
    Act Density 0.002%

    No Known Activations