INDEX
Explanations
instances of decision-making processes and their implications
New Auto-Interp
Head Attr Weights
0:0.01
1:0.01
2:0.04
3:0.04
4:0.05
5:0.03
6:0.38
7:0.15
8:0.04
9:0.04
10:0.09
11:0.06
Negative Logits
ierre
-1.23
retty
-1.20
info
-1.18
othy
-1.18
MIN
-1.16
amon
-1.15
nell
-1.14
reon
-1.13
DN
-1.12
encia
-1.09
POSITIVE LOGITS
EStream
1.33
rall
1.29
propos
1.27
favour
1.26
adaptations
1.25
decisions
1.25
favor
1.23
conformity
1.23
aroo
1.22
subclass
1.21
Activations Density 0.004%