INDEX
Explanations
terms related to rewards and bonuses
New Auto-Interp
Negative Logits
ned
-0.71
Kone
-0.67
Heide
-0.66
jed
-0.65
auto
-0.65
mah
-0.59
Kines
-0.59
Sloan
-0.59
auto
-0.59
Fol
-0.58
POSITIVE LOGITS
reward
1.07
Reward
1.06
AndEndTag
0.99
Reward
0.98
reward
0.95
incentive
0.94
mixtures
0.92
mixture
0.90
rewards
0.90
incentives
0.89
Activations Density 0.116%