INDEX
Explanations
the word "time", and sometimes "heat"
New Auto-Interp
Negative Logits
<eos>
-0.94
.
-0.91
↵
-0.89
the
-0.84
↵↵
-0.82
"
-0.80
“
-0.77
,
-0.77
a
-0.75
to
-0.74
POSITIVE LOGITS
Efq
1.70
Monfieur
1.63
Theſe
1.61
Reſ
1.59
myſelf
1.58
iſt
1.49
Anſ
1.46
pleaſure
1.43
Houſe
1.38
Jefus
1.38
Activations Density 0.282%