INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    tridge
    -0.16
    uur
    -0.15
    ajo
    -0.15
    andan
    -0.14
    adoo
    -0.14
     èĭ
    -0.14
    iegel
    -0.14
    ziel
    -0.13
    ½
    -0.13
    ãĥ³ãĤ¯
    -0.13
    POSITIVE LOGITS
    _WM
    0.16
    à¤łà¤¨
    0.16
    ılıp
    0.14
     Golden
    0.14
    ilingual
    0.14
    à¤ł
    0.14
    isors
    0.14
    زة
    0.14
    -threat
    0.13
    kp
    0.13
    Act Density 0.260%

    No Known Activations