INDEX
Explanations
language model understanding
New Auto-Interp
Negative Logits
uksi
0.89
Ссылки
0.79
(!(
0.78
ровать
0.78
Winkel
0.77
fény
0.77
joner
0.76
க்கை
0.75
करा
0.75
λαν
0.75
POSITIVE LOGITS
spoken
1.03
Speaking
0.93
Bari
0.91
Wikipédia
0.89
Speaking
0.88
ローブ
0.88
pathologists
0.87
zinha
0.86
ట
0.84
Herd
0.83
Activations Density 0.688%