INDEX
Explanations
refusalI apologize for not fulfilling requestscannot comply
New Auto-Interp
Negative Logits
apologies
1.05
apologised
1.03
apologized
1.02
apology
1.02
apologize
1.01
apologizing
1.01
apologise
1.00
apolog
0.98
sorry
0.94
Sorry
0.94
POSITIVE LOGITS
satisfying
0.44
Satisf
0.43
satisfactory
0.42
만족
0.41
unsatisf
0.39
満足
0.38
satisfy
0.37
惊喜
0.36
estran
0.36
necessariamente
0.35
Activations Density 0.017%