INDEX
Explanations
phrases indicating consequences or results related to events or actions
New Auto-Interp
Head Attr Weights
0:0.02
1:0.03
2:0.12
3:0.28
4:0.02
5:0.03
6:0.06
7:0.12
8:0.08
9:0.05
10:0.06
11:0.08
Negative Logits
Downloadha
-1.35
fri
-1.12
activated
-1.07
bats
-1.07
utical
-1.05
Jar
-1.03
IFE
-1.00
pired
-0.99
stalk
-0.98
TY
-0.97
POSITIVE LOGITS
downstream
1.25
subtract
1.16
ⓘ
1.13
Fowler
1.09
abwe
1.03
estival
1.02
itol
1.02
otos
1.01
Chinatown
0.99
corrid
0.98
Activations Density 0.005%