INDEX
    Explanations

    contribution

    New Auto-Interp
    Negative Logits
     contribution
    -2.33
     Contribution
    -2.11
    contribution
    -2.08
     contributions
    -1.95
    Contribution
    -1.89
     contribute
    -1.88
     CONTRIBUTION
    -1.84
     Contribute
    -1.83
     Contributions
    -1.82
     Contributing
    -1.77
    POSITIVE LOGITS
     to
    1.18
     with
    0.62
     in
    0.59
     for
    0.56
     on
    0.54
     by
    0.54
     without
    0.54
     via
    0.52
     through
    0.51
     made
    0.51
    Act Density 0.070%

    No Known Activations