INDEX
Explanations
refusal of harmful requests
New Auto-Interp
Negative Logits
यत
0.79
الو
0.76
linha
0.75
uie
0.75
কর্ম
0.73
iami
0.73
oyu
0.72
ఉంటాయి
0.71
trattano
0.69
ília
0.69
POSITIVE LOGITS
masculinity
0.65
வடக்கு
0.64
zero
0.63
ようやく
0.62
underweight
0.61
east
0.61
audacity
0.61
nada
0.61
гле
0.60
tasteless
0.60
Activations Density 0.068%