INDEX
Explanations
prohibited actions and refusal
New Auto-Interp
Negative Logits
يمكنك
0.76
میتوانید
0.70
reali
0.70
doğrud
0.70
réellement
0.69
होतात
0.69
शकतात
0.69
应当
0.67
prawd
0.66
\"]\
0.66
POSITIVE LOGITS
exception
1.07
はその
1.01
example
0.91
achieve
0.91
exceptions
0.91
achieving
0.90
exemplifies
0.90
例外
0.90
falling
0.88
achieves
0.87
Activations Density 0.090%