INDEX
Explanations
starting to or improving performance
New Auto-Interp
Negative Logits
ancies
0.53
as
0.51
ata
0.47
ates
0.47
otas
0.46
atino
0.44
igail
0.44
ikli
0.44
linge
0.43
cik
0.43
POSITIVE LOGITS
hairy
0.50
τι
0.49
כ
0.48
וב
0.46
נ
0.46
תי
0.45
툐
0.44
ن
0.43
הצ
0.43
करणे
0.43
Activations Density 0.000%