INDEX
Explanations
introductory phrases and conditional statements
starts phrases with articles/pronouns
It detects the start of the model/assistant's output (the beginning-of-sequence or initial tokens of a generated reply).
New Auto-Interp
Negative Logits
⪜
-0.80
snippetHide
-0.80
ſicht
-0.79
zwiſchen
-0.79
Dieſe
-0.78
<unused52>
-0.77
<unused68>
-0.77
<unused23>
-0.77
<unused17>
-0.77
<unused8>
-0.77
POSITIVE LOGITS
The
0.41
dizem
0.37
You
0.37
A
0.32
At
0.31
It
0.29
Usually
0.29
Your
0.29
is
0.28
Just
0.28
Activations Density 0.001%