INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     dung
    -0.07
     doom
    -0.07
    ]';↵
    -0.07
     Lagos
    -0.07
     She
    -0.07
    े।↵
    -0.06
     widget
    -0.06
     باش
    -0.06
     item
    -0.06
     washing
    -0.06
    POSITIVE LOGITS
     अच
    0.07
    cba
    0.06
    نتی
    0.06
     unclear
    0.06
     innovative
    0.06
    Andy
    0.06
    utas
    0.06
    Atlas
    0.05
    งอย
    0.05
     Ang
    0.05
    Act Density 0.015%

    No Known Activations