INDEX
Explanations
phrases indicating threats or potential violence
New Auto-Interp
Negative Logits
ÑĤÑĢи
-0.15
iped
-0.15
653
-0.15
umbs
-0.14
atos
-0.14
Grove
-0.14
reusable
-0.14
undos
-0.14
591
-0.13
offenses
-0.13
POSITIVE LOGITS
electro
0.23
lyn
0.20
staple
0.18
perman
0.17
sued
0.17
pitch
0.17
permanently
0.17
punch
0.16
Electro
0.16
sue
0.16
Activations Density 0.221%