INDEX
Explanations
Generating ethical refusals
New Auto-Interp
Negative Logits
낸
0.44
所以
0.43
यांची
0.42
oluyor
0.42
ува
0.41
所以
0.40
ુંદર
0.40
biết
0.38
之为
0.38
বিজ
0.38
POSITIVE LOGITS
hoped
0.59
supposed
0.55
Theoretically
0.54
Espero
0.51
purported
0.51
controvers
0.49
teoria
0.49
Attempts
0.49
attempt
0.49
alleged
0.48
Activations Density 0.095%