INDEX
Explanations
phrases that express negation or contradiction
New Auto-Interp
Head Attr Weights
0:0.01
1:0.01
2:0.08
3:0.34
4:0.02
5:0.02
6:0.06
7:0.11
8:0.04
9:0.06
10:0.05
11:0.15
Negative Logits
Pengu
-1.30
SY
-1.27
Boat
-1.18
EStream
-1.16
Gi
-1.16
�士
-1.13
��
-1.13
Shades
-1.12
Ammo
-1.12
Lifetime
-1.10
POSITIVE LOGITS
than
1.64
bends
1.54
iffe
1.49
forth
1.29
reaching
1.28
iating
1.27
mentioned
1.26
ences
1.25
aunts
1.24
iations
1.23
Activations Density 0.017%