INDEX
Explanations
transitions indicating rephrased explanations or clarifications
New Auto-Interp
Negative Logits
ucht
-0.14
ntag
-0.14
ubu
-0.14
ujet
-0.14
xin
-0.14
ileen
-0.14
lez
-0.14
ÏĥÏĦή
-0.14
atsu
-0.13
astreet
-0.13
POSITIVE LOGITS
words
0.52
words
0.43
Words
0.34
.words
0.33
_words
0.32
Words
0.31
palabras
0.28
word
0.28
wards
0.27
wards
0.27
Activations Density 0.013%