INDEX
Explanations
sh followed by common word endings
New Auto-Interp
Negative Logits
an
2.83
ли
2.61
er
2.30
theless
2.25
ofthe
2.23
ש
2.22
н
2.17
り
2.13
서
2.06
주
1.95
POSITIVE LOGITS
ের
2.11
OREM
1.96
িম
1.84
uggling
1.84
rappel
1.74
peers
1.72
НА
1.71
EEP
1.69
reps
1.67
िन
1.65
Activations Density 0.147%