INDEX
Explanations
model refusal for unsafe content
New Auto-Interp
Negative Logits
appareil
0.38
iatric
0.38
خیص
0.38
ابنائي
0.37
мои
0.37
benim
0.37
моих
0.37
meiner
0.36
dennoch
0.36
اعه
0.36
POSITIVE LOGITS
sorry
0.60
Sorry
0.54
sorry
0.50
Sorry
0.49
Click
0.46
Disclaimer
0.45
Title
0.43
Content
0.40
Featuring
0.40
Please
0.38
Activations Density 0.010%