INDEX
Explanations
references to rewards or rewarding situations
New Auto-Interp
Negative Logits
Suzy
-0.81
findpost
-0.79
Spie
-0.75
Odkazy
-0.75
tasche
-0.72
mcqueen
-0.72
Iain
-0.71
monks
-0.68
isolado
-0.68
Jha
-0.68
POSITIVE LOGITS
Rewards
1.23
reward
1.19
rewards
1.16
Reward
1.14
Rewards
1.12
rewards
1.04
Reward
1.00
reward
0.94
rewarding
0.81
rewarded
0.78
Activations Density 0.003%