INDEX
Explanations
instances of the word "The."
New Auto-Interp
Negative Logits
at
-0.18
an
-0.18
anks
-0.18
est
-0.17
ease
-0.17
trag
-0.16
atz
-0.16
eyse
-0.15
rana
-0.15
anzi
-0.15
POSITIVE LOGITS
eron
0.23
irteen
0.21
emed
0.21
ermo
0.21
orton
0.20
acker
0.19
ales
0.19
istle
0.19
iel
0.19
aimassage
0.19
Activations Density 0.015%