INDEX
Explanations
phrases and concepts related to threats and violence
New Auto-Interp
Negative Logits
iped
-0.15
prefix
-0.15
135
-0.15
azor
-0.15
295
-0.14
lis
-0.14
/categories
-0.14
Headquarters
-0.14
esper
-0.14
602
-0.14
POSITIVE LOGITS
kill
0.35
Kill
0.28
murder
0.27
kill
0.26
Kill
0.24
.kill
0.24
kid
0.24
commit
0.24
_kill
0.23
kills
0.23
Activations Density 0.291%