INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     uncover
    -0.07
     اختیار
    -0.07
     soft
    -0.07
    gorithms
    -0.07
     eligibility
    -0.07
    _pieces
    -0.06
     reservations
    -0.06
    .,
    -0.06
     availability
    -0.06
    .family
    -0.06
    POSITIVE LOGITS
     insulting
    0.12
     insults
    0.12
     insult
    0.11
    .instagram
    0.07
    .Disclaimer
    0.06
     abusive
    0.06
    .Interfaces
    0.06
     humiliation
    0.06
     thanking
    0.06
     rebut
    0.06
    Act Density 0.006%

    No Known Activations