INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     a
    1.45
    ق
    1.41
    1
    1.33
     is
    1.11
    a
    1.06
    ad
    1.05
    6
    1.04
    H
    1.00
    ade
    0.99
    b
    0.98
    POSITIVE LOGITS
     dwarf
    1.01
     Dwarf
    0.95
    скохозяй
    0.89
    ř
    0.88
    ної
    0.84
    ского
    0.83
     dwar
    0.83
    يده
    0.81
     lecz
    0.81
    и
    0.81
    Act Density 0.004%

    No Known Activations