INDEX
Explanations
acknowledging limitations or inability
role markers that denote the start of the assistant/model’s turn in structured chat transcripts.
New Auto-Interp
Negative Logits
ście
0.40
Salt
0.39
ㄤ
0.37
ätte
0.37
驿
0.36
[&
0.36
Happens
0.35
0.35
muş
0.35
施
0.34
POSITIVE LOGITS
regrett
0.50
regret
0.45
unfortunately
0.45
sorry
0.43
عدم
0.43
regretted
0.42
unable
0.41
inability
0.41
unavailable
0.41
scandalous
0.40
Activations Density 0.031%