INDEX
Explanations
references to harm or risk associated with actions or situations
New Auto-Interp
Negative Logits
viață
-0.48
cementerio
-0.48
cesse
-0.48
JMenuBar
-0.46
ẢN
-0.46
Varian
-0.45
gelassen
-0.45
,:);
-0.45
깥
-0.44
رسال
-0.43
POSITIVE LOGITS
harmed
1.03
harming
0.97
harm
0.93
harms
0.92
harmed
0.88
hurting
0.87
脚注の使い方
0.87
Harm
0.85
harm
0.85
hurt
0.82
Activations Density 0.347%