INDEX
Explanations
punctuation and markers indicative of citations or references within academic texts
New Auto-Interp
Negative Logits
ork
-0.16
warts
-0.16
favor
-0.15
tr
-0.15
trav
-0.14
kinds
-0.14
wen
-0.14
lotte
-0.14
Î
-0.14
en
-0.13
POSITIVE LOGITS
maal
0.15
ardy
0.14
lparr
0.14
á»§i
0.14
eterminate
0.14
gul
0.14
chatt
0.13
ÅĻ
0.13
eydi
0.13
Hüs
0.13
Activations Density 0.011%