INDEX
Explanations
words related to attracting attention, support, or resources
New Auto-Interp
Negative Logits
äºĭ
-0.15
inous
-0.15
Golden
-0.14
jal
-0.14
éĢļ
-0.14
uninitialized
-0.14
plode
-0.13
TY
-0.13
oes
-0.13
omba
-0.13
POSITIVE LOGITS
attention
0.33
attention
0.32
внимание
0.27
Attention
0.24
Attention
0.24
attn
0.22
_attention
0.21
attracted
0.21
ively
0.21
attracts
0.20
Activations Density 0.041%