INDEX
Explanations
references to violence and aggression in various contexts
New Auto-Interp
Negative Logits
Fprintf
-0.58
ArgsConstructor
-0.57
יצוני
-0.50
strerror
-0.47
clic
-0.44
řád
-0.43
ylus
-0.43
OfDay
-0.43
緒
-0.42
<_>
-0.42
POSITIVE LOGITS
unsuspecting
1.34
innocent
1.18
defen
1.09
innocent
1.00
targets
0.98
inocente
0.95
vulnerable
0.94
helpless
0.91
unprotected
0.91
innoc
0.89
Activations Density 0.593%