INDEX
Explanations
lists numbers after punctuation
New Auto-Interp
Negative Logits
is
0.56
п
0.55
But
0.55
стю
0.50
ре
0.50
נת
0.49
пу
0.48
ار
0.48
However
0.48
等
0.47
POSITIVE LOGITS
VILLE
0.63
<unused2156>
0.63
<unused2152>
0.63
يج
0.63
<unused193>
0.63
JOHN
0.62
showcased
0.61
<unused2033>
0.61
بھی
0.61
unveiled
0.61
Activations Density 1.307%