INDEX
Explanations
describes actions or states
New Auto-Interp
Negative Logits
’
1.45
U
1.16
F
1.12
AL
1.09
O
1.06
}
1.00
},
0.98
У
0.96
ER
0.95
}$
0.93
POSITIVE LOGITS
س
1.16
ти
1.07
िया
1.05
स
0.98
は何
0.97
товые
0.94
ឱ្យ
0.93
те
0.93
सिया
0.93
ルス
0.91
Activations Density 0.181%