INDEX
Explanations
punctuation marks, specifically periods and exclamation points
New Auto-Interp
Negative Logits
OKIE
-0.19
fashion
-0.17
Nor
-0.17
stdout
-0.15
esture
-0.15
-fashion
-0.15
okie
-0.15
.refs
-0.14
ouro
-0.14
Fashion
-0.14
POSITIVE LOGITS
unkt
0.16
istrovstvÃŃ
0.15
uze
0.14
anko
0.14
afen
0.14
hire
0.13
grit
0.13
again
0.13
Bender
0.13
vyh
0.13
Activations Density 0.001%