INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ীয়
    -0.08
    මු
    -0.08
     Hedge
    -0.07
     가지
    -0.07
    -0.07
    ীদ
    -0.07
     rejecting
    -0.07
     번째
    -0.07
     taş
    -0.07
    대를
    -0.07
    POSITIVE LOGITS
     improvement
    0.18
     improvements
    0.16
    改善
    0.16
     Improvement
    0.15
     개선
    0.15
     સુધ
    0.14
     Improvements
    0.13
     improves
    0.13
     worsening
    0.13
     melhorias
    0.13
    Act Density 0.062%

    No Known Activations