INDEX
Explanations
words related to significant or impactful actions and their consequences
New Auto-Interp
Head Attr Weights
0:0.02
1:0.01
2:0.07
3:0.05
4:0.14
5:0.03
6:0.06
7:0.37
8:0.03
9:0.04
10:0.06
11:0.06
Negative Logits
Simple
-1.48
CHO
-1.44
mand
-1.43
hing
-1.42
iae
-1.40
cho
-1.38
iger
-1.38
anza
-1.38
RH
-1.37
glers
-1.35
POSITIVE LOGITS
boosters
1.68
medals
1.66
champagne
1.54
gobl
1.49
laure
1.47
bounty
1.44
Medals
1.44
Rivals
1.43
�
1.42
papers
1.42
Activations Density 0.001%