INDEX
Explanations
articles and their variations, indicating a focus on nouns or noun phrases
New Auto-Interp
Negative Logits
pth
-0.18
777
-0.17
321
-0.16
821
-0.15
804
-0.15
906
-0.15
dden
-0.15
Knot
-0.14
venir
-0.14
attles
-0.14
POSITIVE LOGITS
isse
0.16
anda
0.15
ument
0.15
pra
0.15
UDO
0.15
ences
0.15
olle
0.15
interop
0.14
оваÑĢи
0.14
.lp
0.14
Activations Density 0.068%