INDEX
Explanations
terms related to attacks on Western and US interests
references to terrorist activities and attacks against the United States or its interests
New Auto-Interp
Negative Logits
oult
-0.75
lopp
-0.66
ammy
-0.66
adjust
-0.65
FU
-0.64
rosc
-0.64
ISTORY
-0.63
cend
-0.63
apter
-0.62
join
-0.62
POSITIVE LOGITS
targets
1.02
unarmed
0.89
innocent
0.85
civilians
0.81
helpless
0.79
innoc
0.77
unsuspecting
0.77
target
0.76
unprotected
0.75
suspected
0.75
Activations Density 0.745%