INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    -0.09
     Raj
    -0.08
     Riy
    -0.08
     stewardship
    -0.08
    aissez
    -0.07
     absolutely
    -0.07
    hamb
    -0.07
    hab
    -0.07
     زي
    -0.07
    habit
    -0.07
    POSITIVE LOGITS
    coln
    0.08
     svc
    0.08
     Sox
    0.08
     ylabel
    0.08
    _PLAYER
    0.08
     ustaw
    0.08
    ror
    0.07
     keeps
    0.07
    sko
    0.07
     הזה
    0.07
    Act Density 0.053%

    No Known Activations