INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     Irr
    -0.09
     Merry
    -0.08
    baz
    -0.08
     champion
    -0.08
     יפה
    -0.08
     Depression
    -0.08
     Musa
    -0.08
     Champion
    -0.08
     elucid
    -0.08
     החר
    -0.07
    POSITIVE LOGITS
     أجل
    0.10
     scratch
    0.10
     امله
    0.09
     thiện
    0.09
     ҷониби
    0.08
     مخې
    0.08
     طریق
    0.08
     afar
    0.08
     perspectives
    0.08
     خلالها
    0.08
    Act Density 0.132%

    No Known Activations