INDEX
Explanations
features related to language learning tools and user engagement
New Auto-Interp
Negative Logits
resher
-0.15
пеÑĢе
-0.14
regs
-0.14
atra
-0.14
егоÑĢ
-0.14
ander
-0.13
uncture
-0.13
arta
-0.13
ãģ¶
-0.13
capacity
-0.13
POSITIVE LOGITS
rewards
0.35
reward
0.33
progress
0.28
challenge
0.28
points
0.28
Rewards
0.27
earn
0.27
leaderboard
0.27
competition
0.26
rewarded
0.26
Activations Density 0.116%