INDEX
Explanations
text related to physical instructions or steps
New Auto-Interp
Negative Logits
!.
-0.77
%.
-0.69
$.
-0.66
,...
-0.65
+.
-0.64
*.
-0.63
although
-0.63
'.
-0.63
';
-0.61
HY
-0.61
POSITIVE LOGITS
pires
0.89
depends
0.73
constitutes
0.72
entails
0.69
pired
0.69
involves
0.68
isn
0.66
mattered
0.63
varies
0.62
implies
0.62
Activations Density 3.163%