INDEX
Explanations
mathematical and logical structures or concepts
New Auto-Interp
Negative Logits
ekim
-0.17
aidu
-0.16
ÃĹ↵↵
-0.16
kek
-0.15
artz
-0.15
cheon
-0.14
addy
-0.14
quam
-0.14
igham
-0.14
ãĥ¼ãĥĨ
-0.13
POSITIVE LOGITS
Nat
0.25
nat
0.24
Nat
0.22
nat
0.20
destruct
0.19
induction
0.18
Tactics
0.18
tactic
0.17
wf
0.17
Wells
0.17
Activations Density 0.003%