INDEX
Explanations
references to heads and their positions or states
New Auto-Interp
Negative Logits
atz
-0.16
lesi
-0.15
Hop
-0.15
destin
-0.14
aji
-0.14
ainer
-0.14
ways
-0.14
jet
-0.14
osta
-0.14
имо
-0.14
POSITIVE LOGITS
wag
0.15
][(
0.15
andon
0.15
ichick
0.14
?page
0.13
ighbor
0.13
ycz
0.13
ibold
0.13
@(
0.13
çĬ
0.13
Activations Density 0.073%