INDEX
Explanations
refusing harmful story continuations
New Auto-Interp
Negative Logits
ót
0.48
🤟
0.48
आईपी
0.43
ಗ
0.43
ماں
0.42
గే
0.41
ഐ
0.41
bearbeitet
0.41
ೀರಿ
0.41
phthalm
0.40
POSITIVE LOGITS
https
0.49
astikan
0.46
ה
0.46
completes
0.45
Kết
0.44
spinner
0.43
Selamat
0.41
kết
0.41
j
0.41
concludes
0.40
Activations Density 0.001%