INDEX
Explanations
statements that point out inconsistencies or logical fallacies
New Auto-Interp
Negative Logits
yte
-0.16
someday
-0.14
perhaps
-0.14
Perhaps
-0.14
çe
-0.14
interesting
-0.13
_:*
-0.13
recommendations
-0.13
ĥn
-0.13
somewhat
-0.13
POSITIVE LOGITS
Impossible
0.21
unlikely
0.19
ãģ¯ãģļ
0.19
wouldn
0.17
unlikely
0.17
timeline
0.17
unless
0.17
Impossible
0.17
would
0.17
imposs
0.16
Activations Density 0.211%