INDEX
Explanations
references to the concepts of protection and prevention related to people and their actions
New Auto-Interp
Head Attr Weights
0:0.02
1:0.02
2:0.05
3:0.13
4:0.37
5:0.04
6:0.04
7:0.12
8:0.03
9:0.03
10:0.06
11:0.05
Negative Logits
Rohing
-1.82
Airl
-1.51
unden
-1.47
Wheels
-1.45
advoc
-1.41
lil
-1.41
Ashton
-1.40
forums
-1.37
:{-1.37
�
-1.35
POSITIVE LOGITS
harm
1.73
undue
1.70
adversely
1.62
harmful
1.62
avoidance
1.61
future
1.60
ividual
1.60
altogether
1.48
discriminating
1.47
falsely
1.47
Activations Density 0.067%