INDEX
Explanations
refusing harmful or unethical requests
New Auto-Interp
Negative Logits
should
0.46
应该
0.45
hopefully
0.45
dovrebbe
0.45
conviene
0.44
باید
0.43
তবে
0.42
應該
0.42
può
0.42
manchmal
0.42
POSITIVE LOGITS
request
0.80
언급
0.75
request
0.72
requesting
0.72
описание
0.72
descriptions
0.70
description
0.69
richiesta
0.68
запрос
0.68
requested
0.67
Activations Density 0.063%