INDEX
    Explanations

    incremental changes in values related to variables

    New Auto-Interp
    Negative Logits
     Reward
    -0.14
    imuth
    -0.14
    ognitive
    -0.14
    onio
    -0.13
    aq
    -0.13
    IRT
    -0.13
    uw
    -0.13
    oni
    -0.13
    aal
    -0.13
    isz
    -0.13
    POSITIVE LOGITS
    cript
    0.15
    acebook
    0.14
    ypi
    0.14
    atus
    0.14
     Until
    0.13
    inz
    0.13
     hasta
    0.13
    ови
    0.13
    Aspect
    0.13
    water
    0.13
    Act Density 0.025%

    No Known Activations