INDEX
Explanations
occurrences of the word "the."
New Auto-Interp
Negative Logits
ollen
-0.17
trao
-0.16
.synthetic
-0.15
anza
-0.14
olen
-0.14
ative
-0.14
newsp
-0.14
andra
-0.14
acam
-0.14
erta
-0.14
POSITIVE LOGITS
uce
0.14
лоÑĢ
0.14
venes
0.14
еÑĢап
0.14
лиÑħ
0.14
éĵ
0.14
bable
0.13
caff
0.13
vice
0.13
ãĥĬãĥ¼
0.13
Activations Density 0.161%