INDEX
Explanations
instances of being targeted or labeled as targets in various contexts
New Auto-Interp
Head Attr Weights
0:0.02
1:0.01
2:0.12
3:0.05
4:0.21
5:0.03
6:0.04
7:0.28
8:0.04
9:0.03
10:0.05
11:0.06
Negative Logits
heit
-1.74
ACTED
-1.50
VOL
-1.48
izont
-1.41
ctrl
-1.40
�
-1.38
ROM
-1.38
hop
-1.36
verbs
-1.36
redo
-1.34
POSITIVE LOGITS
trespass
1.57
ridicule
1.57
intrusion
1.41
tresp
1.37
sidel
1.37
destro
1.35
skelet
1.35
Barron
1.35
unreasonable
1.35
pilgr
1.34
Activations Density 0.003%