INDEX
Explanations
references to attention or focus in various contexts
New Auto-Interp
Negative Logits
rices
-0.18
inton
-0.17
hack
-0.17
hma
-0.16
itude
-0.16
acker
-0.16
imuth
-0.16
pora
-0.15
Injector
-0.15
hood
-0.15
POSITIVE LOGITS
ested
0.27
estation
0.24
uned
0.23
ests
0.23
esting
0.23
itud
0.21
en
0.21
aining
0.21
a
0.20
t
0.19
Activations Density 0.006%