INDEX
Explanations
phrases related to violence or threats of harm
references to violence or death
New Auto-Interp
Negative Logits
angular
-0.79
herty
-0.78
cancell
-0.77
RGB
-0.76
ARB
-0.75
taboola
-0.72
Applic
-0.71
ional
-0.70
soDeliveryDate
-0.70
Management
-0.68
POSITIVE LOGITS
senseless
1.01
aven
0.96
vengeance
0.95
revenge
0.94
innocent
0.92
martyr
0.87
retribution
0.84
innoc
0.83
ransom
0.81
coward
0.80
Activations Density 0.588%