INDEX
Explanations
words and phrases indicating significant changes or impacts
`revealed itself`, `different`, `whether we`, `typical sequences`
New Auto-Interp
Negative Logits
future
-0.31
↵
-0.27
deres
-0.27
ext
-0.25
先
-0.25
précieux
-0.25
futurs
-0.25
وند
-0.24
\
-0.24
next
-0.24
POSITIVE LOGITS
فريبيس
0.82
<pad>
0.80
<unused41>
0.80
<unused68>
0.80
<unused8>
0.80
[@BOS@]
0.80
<unused42>
0.79
<unused43>
0.79
<unused28>
0.79
<unused14>
0.79
Activations Density 0.134%