INDEX
Explanations
the followed by specific nouns
the followed by noun
New Auto-Interp
Negative Logits
The
0.95
The
0.89
an
0.82
↵
0.76
et
0.74
f
0.70
he
0.67
nThe
0.67
b
0.67
k
0.65
POSITIVE LOGITS
at
0.59
is
0.59
ت
0.58
О
0.58
َل
0.56
贺
0.56
र
0.56
paquete
0.56
ق
0.55
లేదు
0.55
Activations Density 0.301%