INDEX
Explanations
phrases referencing existential questions and reasons for actions
New Auto-Interp
Negative Logits
ve
-0.15
(AF
-0.15
');?>"
-0.14
rim
-0.14
Franklin
-0.14
Ø¢ÙĤ
-0.14
Laur
-0.14
¢
-0.14
.cli
-0.13
.idx
-0.13
POSITIVE LOGITS
simply
0.25
random
0.22
reasons
0.20
inexp
0.20
nothing
0.20
randomly
0.19
reason
0.19
why
0.19
mysterious
0.18
arbitrary
0.18
Activations Density 0.171%