INDEX
Explanations
phrases that indicate authorship or the means by which something is accomplished
New Auto-Interp
Negative Logits
ahat
-0.14
rello
-0.14
.sz
-0.14
avenport
-0.13
ccak
-0.13
icari
-0.13
lavÃŃ
-0.13
voy
-0.13
AFX
-0.13
IDX
-0.13
POSITIVE LOGITS
being
0.22
virtue
0.18
means
0.17
ung
0.17
having
0.16
olan
0.16
mand
0.15
being
0.15
apy
0.15
not
0.15
Activations Density 0.070%