INDEX
Explanations
attends to tokens that imply completion or reaching a goal from tokens indicating a lack or absence
New Auto-Interp
Head Attr Weights
0:0.15
1:0.11
2:0.10
3:0.11
4:0.09
5:0.04
6:0.18
7:0.17
Negative Logits
viewDidLoad
-0.32
astore
-0.31
<eos>
-0.31
PRNewswire
-0.27
manni
-0.27
ticularly
-0.27
icrous
-0.25
↵
-0.25
raquo
-0.25
hoạch
-0.24
POSITIVE LOGITS
myſelf
0.63
itſelf
0.53
himſelf
0.52
Monfieur
0.52
themſelves
0.50
reaſon
0.50
uſed
0.49
ſtate
0.48
pleaſure
0.47
Majefty
0.47
Activations Density 0.046%