INDEX
Explanations
phrases emphasizing concentration or direct attention
New Auto-Interp
Negative Logits
ishly
-0.20
ish
-0.19
orce
-0.17
aly
-0.17
iggers
-0.16
ity
-0.16
hiba
-0.15
eu
-0.15
utan
-0.15
een
-0.15
POSITIVE LOGITS
attention
0.23
areas
0.22
SED
0.21
area
0.20
point
0.20
Areas
0.19
(es
0.18
point
0.18
-area
0.18
shifted
0.18
Activations Density 0.041%