INDEX
Explanations
instances of authorization or approval within a context of accountability or ethical dilemmas
New Auto-Interp
Head Attr Weights
0:0.02
1:0.01
2:0.14
3:0.24
4:0.13
5:0.02
6:0.06
7:0.05
8:0.07
9:0.06
10:0.09
11:0.04
Negative Logits
arrang
-1.71
comprom
-1.70
respectively
-1.70
mosqu
-1.67
Canaver
-1.58
millenn
-1.56
satell
-1.55
��
-1.50
franch
-1.46
Adin
-1.46
POSITIVE LOGITS
[/
2.08
[+
1.89
fiction
1.79
park
1.73
ipedia
1.73
Trivia
1.70
:=
1.70
·
1.67
@
1.64
language
1.64
Activations Density 0.002%