INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    -placeholder
    -0.07
     تكون
    -0.07
    Acts
    -0.07
    ในท
    -0.07
     '>'
    -0.07
     crippled
    -0.06
     دهه
    -0.06
     kolem
    -0.06
    gado
    -0.06
     Joan
    -0.06
    POSITIVE LOGITS
     Fur
    0.09
    fur
    0.07
     TRUE
    0.07
     fur
    0.07
    RIC
    0.06
    0.06
     logo
    0.06
    ursed
    0.06
    0.06
     paw
    0.06
    Act Density 0.003%

    No Known Activations