INDEX
Explanations
harmful and inappropriate content
New Auto-Interp
Negative Logits
చు
0.68
murdered
0.68
slain
0.65
killings
0.65
मरने
0.64
stabbed
0.64
murders
0.62
humble
0.62
précéd
0.61
boredom
0.61
POSITIVE LOGITS
inappropriate
2.55
unethical
2.31
unsuitable
2.07
unsustainable
2.01
improper
2.00
harmful
2.00
inappropri
1.98
unhealthy
1.97
irresponsible
1.93
inappropriately
1.92
Activations Density 2.043%