INDEX
Explanations
words and phrases related to being rewarded for certain actions or behaviors
terms related to rewarding and punishing actions
New Auto-Interp
Negative Logits
frame
-0.75
space
-0.72
sie
-0.72
frames
-0.68
cell
-0.66
CON
-0.66
orig
-0.65
alter
-0.65
aug
-0.65
issues
-0.65
POSITIVE LOGITS
rewarded
1.54
rewarding
1.15
rewards
1.11
reward
1.08
tremend
0.98
nesday
0.98
incentiv
0.89
veter
0.87
reap
0.87
showc
0.87
Activations Density 0.015%