INDEX
    Explanations

    references to rewards or rewarding situations

    New Auto-Interp
    Negative Logits
     Suzy
    -0.81
    findpost
    -0.79
     Spie
    -0.75
    Odkazy
    -0.75
     tasche
    -0.72
     mcqueen
    -0.72
     Iain
    -0.71
     monks
    -0.68
     isolado
    -0.68
     Jha
    -0.68
    POSITIVE LOGITS
     Rewards
    1.23
     reward
    1.19
     rewards
    1.16
     Reward
    1.14
    Rewards
    1.12
    rewards
    1.04
    Reward
    1.00
    reward
    0.94
     rewarding
    0.81
     rewarded
    0.78
    Act Density 0.003%

    No Known Activations