INDEX
Explanations
programmed to refuse harmful requests
New Auto-Interp
Negative Logits
Wend
0.47
нен
0.40
мет
0.40
Vend
0.38
wend
0.37
Adi
0.37
السالب
0.36
SK
0.36
Ժ
0.36
skip
0.35
POSITIVE LOGITS
cubes
0.38
শনে
0.37
Saha
0.37
સુ
0.36
STAR
0.36
pug
0.35
Keenan
0.35
PANEL
0.35
evaporated
0.35
Kiv
0.35
Activations Density 0.004%