INDEX
Explanations
instances of the word "the"
New Auto-Interp
Negative Logits
egend
-0.17
alth
-0.15
azon
-0.15
моÑĤ
-0.14
erdale
-0.14
Mans
-0.13
eah
-0.13
itor
-0.13
zon
-0.13
essel
-0.13
POSITIVE LOGITS
pez
0.16
ght
0.16
obt
0.16
verts
0.15
pires
0.15
oints
0.15
we
0.15
åĩºäºĨ
0.15
UGHT
0.14
aved
0.14
Activations Density 0.128%