INDEX
Explanations
is / are followed by adjective or placeholder
New Auto-Interp
Negative Logits
ش
1.02
?
1.00
在
0.98
*
0.96
ل
0.95
ти
0.93
도
0.86
ول
0.85
أ
0.85
خ
0.82
POSITIVE LOGITS
are
1.14
d
1.13
is
0.99
dır
0.92
t
0.91
larını
0.89
した
0.86
has
0.82
larının
0.80
ła
0.77
Activations Density 0.705%