INDEX
Explanations
phrases that include confirmations, warnings, or prompts indicating user actions or limits
New Auto-Interp
Negative Logits
↵↵
-0.75
<eos>
-0.71
–
-0.61
–
-0.52
—
-0.49
WHEN
-0.47
↵↵↵↵
-0.46
EnableWeb
-0.43
Ibid
-0.43
↵↵↵↵↵
-0.42
POSITIVE LOGITS
\'
1.25
!");
1.24
!\
1.19
.");
1.14
:");
1.13
!');
1.13
...");
1.09
!";
1.09
!')
1.07
.');
1.06
Activations Density 0.446%