INDEX
Explanations
pronouns followed by actions
New Auto-Interp
Negative Logits
೫
0.60
ਡ
0.59
on
0.57
트는
0.57
त
0.56
менте
0.56
מ
0.56
∈
0.54
五
0.54
ಾನೆ
0.54
POSITIVE LOGITS
P
0.61
idő
0.59
o
0.59
V
0.56
blive
0.55
alcanz
0.53
و
0.53
Khi
0.52
julho
0.52
każdy
0.52
Activations Density 0.087%