INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     policeman
    -0.07
    نة
    -0.07
     گن
    -0.07
    ışman
    -0.07
     دل
    -0.07
    entral
    -0.07
     dearly
    -0.06
     `\
    -0.06
     translator
    -0.06
    -0.06
    POSITIVE LOGITS
    .ASCII
    0.07
    0.06
    ционных
    0.06
    (memory
    0.06
     Americ
    0.06
     )↵
    0.06
    0.06
     ов
    0.06
    0.06
     optimism
    0.06
    Act Density 0.004%

    No Known Activations