INDEX
Explanations
instances of denial or refusal related to various topics
New Auto-Interp
Negative Logits
/INFO
-0.15
itz
-0.14
RTL
-0.14
ientos
-0.14
umlu
-0.14
Spectrum
-0.14
nelle
-0.14
SCR
-0.13
leh
-0.13
ie
-0.13
POSITIVE LOGITS
uga
0.18
egal
0.17
ecure
0.17
arat
0.15
issance
0.15
igma
0.15
gettext
0.14
á»ĩu
0.14
oux
0.14
arges
0.14
Activations Density 0.063%