INDEX
Explanations
questions and statements addressing the reader or listener directly
New Auto-Interp
Negative Logits
yrights
-0.83
ces
-0.78
inery
-0.73
edIn
-0.72
opens
-0.72
ooks
-0.71
ges
-0.69
ruciating
-0.68
tails
-0.68
uces
-0.68
POSITIVE LOGITS
?'
1.06
?'"
1.04
ever
1.01
?"
0.98
?
0.95
?)
0.94
?:
0.92
...?
0.90
?!
0.86
????
0.86
Activations Density 0.513%