INDEX
Explanations
references to violent actions and military conflicts
New Auto-Interp
Negative Logits
ArgsConstructor
-0.64
ologues
-0.55
FontWeight
-0.54
יצוני
-0.54
Fprintf
-0.53
strerror
-0.52
kasarigan
-0.50
alapján
-0.47
ää
-0.47
tác
-0.47
POSITIVE LOGITS
unsuspecting
1.41
innocent
1.36
defen
1.30
innocent
1.12
inocente
1.07
helpless
1.04
innoc
0.99
unarmed
0.98
targets
0.97
hapless
0.93
Activations Density 0.502%