INDEX
Explanations
was followed by past states
New Auto-Interp
Negative Logits
it
1.21
ir
1.06
ing
1.05
ش
1.04
ant
1.02
il
0.96
ے
0.96
?
0.95
ка
0.94
ly
0.93
POSITIVE LOGITS
ي
1.35
кі
1.01
י
0.98
。
0.95
was
0.94
is
0.85
0.84
0
0.83
。)
0.82
。</
0.79
Activations Density 0.164%