INDEX
Explanations
references to introductory phrases or sentences
New Auto-Interp
Negative Logits
tas
-0.18
duct
-0.17
eries
-0.16
bon
-0.15
ordin
-0.15
ires
-0.15
/about
-0.15
sofar
-0.15
ographically
-0.14
IFF
-0.14
POSITIVE LOGITS
iciar
0.19
eview
0.17
gle
0.16
imary
0.16
antry
0.15
amins
0.14
ngr
0.14
IMIT
0.14
ÃŃcio
0.14
inez
0.14
Activations Density 0.116%