INDEX
Explanations
requests for inappropriate or harmful content.
New Auto-Interp
Negative Logits
загру
0.43
loads
0.42
overloaded
0.42
riches
0.42
impatient
0.41
Nodes
0.41
отлич
0.40
!
0.40
overloading
0.40
流
0.40
POSITIVE LOGITS
harmless
0.84
legitt
0.79
legitimate
0.73
legít
0.73
permissible
0.71
lawful
0.71
あくまで
0.68
lawfully
0.67
innocuous
0.67
respectful
0.64
Activations Density 1.771%