INDEX
Explanations
unwanted or explicit requests
New Auto-Interp
Negative Logits
0.96
!
0.84
+
0.75
(
0.74
better
0.72
or
0.72
:
0.71
+
0.71
0.70
(
0.69
POSITIVE LOGITS
purporting
1.20
misog
0.91
unwarranted
0.88
aksud
0.88
сексуа
0.86
disrespectful
0.85
purportedly
0.85
alleging
0.84
політи
0.84
indiscrimin
0.84
Activations Density 0.014%