INDEX
Explanations
phrases indicating outcomes or consequences
New Auto-Interp
Head Attr Weights
0:0.02
1:0.03
2:0.08
3:0.09
4:0.02
5:0.04
6:0.11
7:0.08
8:0.06
9:0.26
10:0.07
11:0.10
Negative Logits
Reviewed
-1.14
Attempt
-1.13
itement
-1.13
undo
-1.03
vious
-1.02
�
-1.02
conn
-0.97
upload
-0.96
ourselves
-0.96
Administration
-0.96
POSITIVE LOGITS
hey
1.04
olor
1.03
peril
1.00
haunt
0.99
stride
0.97
setback
0.96
corro
0.93
spr
0.92
los
0.91
gewater
0.91
Activations Density 0.041%