INDEX
Explanations
prefix-suffix word construction
New Auto-Interp
Negative Logits
on
0.93
า
0.80
was
0.79
다
0.77
an
0.75
at
0.69
ap
0.64
كان
0.64
landen
0.62
ా
0.60
POSITIVE LOGITS
2
0.75
3
0.75
7
0.66
9
0.66
0
0.64
8
0.62
1
0.61
6
0.58
4
0.57
den
0.54
Activations Density 1.440%