INDEX
Explanations
intentionally bad reward model
New Auto-Interp
Negative Logits
верну
0.48
තු
0.48
결
0.45
পারিল
0.45
ಮತ್ತೆ
0.45
یک
0.45
পুনরায়
0.44
بین
0.44
лизова
0.44
ವಹ
0.44
POSITIVE LOGITS
uk
0.48
дзен
0.44
以外
0.42
present
0.42
Lack
0.42
prefer
0.41
mazing
0.41
缺乏
0.40
ac
0.40
Apart
0.40
Activations Density 0.001%