INDEX
Explanations
phrases related to causing harm to oneself or others
references to death and killing
New Auto-Interp
Negative Logits
soType
-0.83
eret
-0.82
taboola
-0.76
Wide
-0.76
soDeliveryDate
-0.73
å§«
-0.72
ĸļ
-0.70
tv
-0.69
URI
-0.68
worthiness
-0.67
POSITIVE LOGITS
innocent
0.84
unborn
0.81
unarmed
0.81
intruder
0.80
terrorists
0.78
messenger
0.77
senseless
0.75
classmate
0.74
murderer
0.74
crap
0.73
Activations Density 0.161%