INDEX
Explanations
references to different types of violence and harassment
topics related to violence and harassment
New Auto-Interp
Negative Logits
mega
-0.72
Slim
-0.66
pop
-0.64
izons
-0.63
outlook
-0.63
yssey
-0.62
Blueprint
-0.62
hedral
-0.62
Bright
-0.61
Zen
-0.61
POSITIVE LOGITS
perpetrated
1.30
inflicted
1.25
oneself
1.04
punishable
1.04
offences
1.03
prohibited
1.02
unintentional
1.00
uttered
0.99
inflic
0.99
manslaughter
0.98
Activations Density 0.338%