INDEX
    Explanations

    positive outcomes from inputs

    New Auto-Interp
    Negative Logits
    rets
    0.38
    versing
    0.37
    占据
    0.36
     provoca
    0.35
    Differences
    0.35
    失去了
    0.35
    apshot
    0.35
     esc
    0.35
     introductions
    0.35
     discouraging
    0.34
    POSITIVE LOGITS
     benefitting
    2.00
     benefiting
    1.95
     benefit
    1.76
     beneficio
    1.66
     benefited
    1.62
    benefit
    1.61
     benefitted
    1.55
     beneficia
    1.53
    benef
    1.46
     Benefit
    1.43
    Act Density 0.012%

    No Known Activations