INDEX
Explanations
references to physical harm or damage
New Auto-Interp
Negative Logits
lify
-0.17
GI
-0.17
enty
-0.15
shire
-0.15
midi
-0.15
?(:
-0.15
mie
-0.15
خاÙĨÙĩ
-0.15
Nİ
-0.14
anness
-0.14
POSITIVE LOGITS
done
0.29
害
0.22
Done
0.22
done
0.21
DONE
0.21
sustained
0.21
Done
0.20
-done
0.18
aceutical
0.18
(done
0.17
Activations Density 0.060%