INDEX
Explanations
phrases related to rewards and motivations
New Auto-Interp
Negative Logits
aug
-1.07
enium
-1.02
okin
-1.00
gow
-0.99
obiles
-0.97
uka
-0.96
enic
-0.95
ovies
-0.95
soType
-0.93
pac
-0.93
POSITIVE LOGITS
rewarded
1.28
reward
1.26
rewards
1.14
reinforcement
1.05
payoff
1.02
reap
0.99
punishment
0.99
sanction
0.97
punish
0.96
accordingly
0.96
Activations Density 1.103%