INDEX
Explanations
phrases indicating expectation or surprise
New Auto-Interp
Negative Logits
esub
-0.18
adients
-0.17
ê¼
-0.15
oir
-0.15
gon
-0.15
ghi
-0.15
.opens
-0.15
à¥įदर
-0.14
-League
-0.14
okable
-0.14
POSITIVE LOGITS
comes
0.41
come
0.39
Come
0.34
come
0.33
comes
0.32
Come
0.31
came
0.28
Comes
0.28
æĿ¥
0.26
ä¾Ĩ
0.24
Activations Density 0.019%