INDEX
Explanations
names of people and their descriptions
lists and punctuation
New Auto-Interp
Negative Logits
u
0.77
א
0.73
,
0.70
다음
0.66
の
0.65
in
0.63
and
0.62
f
0.62
ने
0.61
в
0.61
POSITIVE LOGITS
1.02
(
0.63
0.57
carrito
0.53
(«
0.52
grumpy
0.51
illä
0.49
waged
0.48
(
0.48
önet
0.48
Activations Density 0.008%