INDEX
Explanations
words meaning never or no
the assistant's safety-focused disclaimers and strong refusal/ethical-warning statements.
New Auto-Interp
Negative Logits
較
0.52
보다는
0.47
较
0.46
কিছুটা
0.46
somewhat
0.45
повече
0.45
বেশি
0.45
יותר
0.45
оптими
0.44
лишком
0.43
POSITIVE LOGITS
niemals
0.67
jamás
0.62
ningún
0.62
assolutamente
0.61
Nunca
0.61
never
0.61
NEVER
0.60
hiçbir
0.59
Nunca
0.59
ninguna
0.58
Activations Density 0.449%