INDEX
Explanations
AI limitations and refusals
New Auto-Interp
Negative Logits
乐观
0.79
nice
0.79
ranking
0.75
зробити
0.75
guesses
0.74
മികച്ച
0.74
मजा
0.73
अच्छा
0.71
easy
0.70
чать
0.70
POSITIVE LOGITS
Again
1.01
পরিবর্তিত
0.96
如果您
0.96
novamente
0.94
Again
0.92
refusal
0.89
Notwithstanding
0.89
erneut
0.89
reaff
0.88
Despite
0.88
Activations Density 0.294%