INDEX
Explanations
references to violence and threats to safety
Attacking, targeting, or harming others
violence against innocents
New Auto-Interp
Negative Logits
ArgsConstructor
-0.62
Fprintf
-0.54
יצוני
-0.53
strerror
-0.48
igény
-0.46
toHave
-0.46
tác
-0.45
FontWeight
-0.45
számára
-0.44
gestos
-0.44
POSITIVE LOGITS
unsuspecting
1.35
innocent
1.20
defen
1.07
innocent
1.00
inocente
0.94
unprotected
0.91
indiscrimin
0.90
helpless
0.90
innoc
0.88
hapless
0.87
Activations Density 0.488%