INDEX
Explanations
now, then contrasting reality
New Auto-Interp
Negative Logits
ون
0.75
ро
0.70
ان
0.67
و
0.65
의
0.64
на
0.63
り
0.63
로
0.63
ம்
0.61
u
0.59
POSITIVE LOGITS
are
0.81
of
0.80
t
0.74
on
0.74
at
0.66
0.64
ت
0.64
com
0.62
was
0.62
is
0.60
Activations Density 0.002%