INDEX
Explanations
predicting word after positive
New Auto-Interp
Negative Logits
在
1.14
አ
1.00
も
0.99
In
0.98
và
0.98
ን
0.98
し
0.97
ING
0.96
Además
0.96
ל
0.95
POSITIVE LOGITS
ма
1.15
지
1.15
<0x80>
1.06
тна
1.05
i
1.02
il
1.02
ิ
1.01
positive
0.97
is
0.94
ang
0.94
Activations Density 0.035%