INDEX
Explanations
concepts related to mathematical or logical reasoning
New Auto-Interp
Negative Logits
--
-0.91
--↵
-0.68
–
-0.63
---
-0.60
-
-0.57
âĢķ
-0.56
â
-0.52
--↵↵
-0.52
Â
-0.51
âĶĢ
-0.48
POSITIVE LOGITS
"—
0.27
—is
0.24
—but
0.24
">-->↵
0.24
—are
0.23
)—
0.23
—"
0.23
—which
0.23
”—
0.23
—that
0.22
Activations Density 0.661%