INDEX
Explanations
violates my safety guidelines
New Auto-Interp
Negative Logits
alahkan
0.40
iseite
0.38
(!$
0.38
عليكم
0.38
unlike
0.37
erçe
0.37
vigil
0.37
correctement
0.36
juridique
0.36
regulatory
0.36
POSITIVE LOGITS
several
0.77
plusieurs
0.68
Several
0.68
several
0.65
عدة
0.61
varios
0.59
flera
0.57
Several
0.57
varias
0.56
कई
0.56
Activations Density 0.011%