INDEX
Explanations
principle followed by explanation
New Auto-Interp
Negative Logits
are
0.98
ру
0.84
ियों
0.77
пи
0.77
て
0.76
ми
0.75
is
0.73
ре
0.73
но
0.71
ての
0.70
POSITIVE LOGITS
us
0.86
ad
0.80
↵
0.76
น้อง
0.75
il
0.74
ir
0.74
H
0.71
Α
0.71
HAS
0.70
IS
0.70
Activations Density 0.021%