INDEX
Explanations
rhetorical questions and expressions of surprise
New Auto-Interp
Negative Logits
loth
-0.16
lost
-0.14
Hey
-0.14
Yap
-0.14
lak
-0.14
Heaven
-0.14
ups
-0.13
alling
-0.13
.HTML
-0.13
اÙĦبÙĦ
-0.13
POSITIVE LOGITS
WRONG
0.33
wrong
0.32
Wrong
0.28
Wrong
0.28
incorrect
0.28
wrong
0.27
оÑĪиб
0.23
Incorrect
0.22
mistaken
0.22
incorrect
0.21
Activations Density 0.101%