INDEX
Explanations
proper names followed by pronouns
here's a breakdown
New Auto-Interp
Negative Logits
↵↵
0.84
3
0.79
تون
0.77
6
0.77
8
0.75
7
0.68
ból
0.66
쳐
0.66
4
0.65
٣
0.65
POSITIVE LOGITS
and
1.06
in
0.99
be
0.93
ید
0.90
ish
0.89
I
0.88
an
0.85
ö
0.82
ale
0.77
ist
0.76
Activations Density 0.001%