INDEX
Explanations
phrases related to harmful acts or actions
words related to acts of violence and their consequences
New Auto-Interp
Negative Logits
arest
-0.72
Plat
-0.70
aver
-0.67
oult
-0.67
pole
-0.66
iHUD
-0.66
hack
-0.65
therapy
-0.64
oret
-0.63
arro
-0.61
POSITIVE LOGITS
perpetrated
1.07
committing
0.92
withd
0.91
interstitial
0.90
committed
0.86
ahime
0.84
impunity
0.82
ãĥ¼ãĥĨ
0.80
heinous
0.80
20439
0.78
Activations Density 0.007%