INDEX
Explanations
variable names like D, B, A, E
New Auto-Interp
Negative Logits
aaa
1.05
cnc
1.04
xii
1.02
xiii
1.00
erc
1.00
xiv
0.98
tds
0.98
xvii
0.96
🔃
0.96
xvi
0.96
POSITIVE LOGITS
B
2.06
C
2.02
B
2.02
E
1.99
E
1.99
C
1.96
F
1.94
D
1.94
G
1.92
D
1.92
Activations Density 0.613%