INDEX
Explanations
dialogue turns or conversational openings
New Auto-Interp
Negative Logits
[](
-0.09
lav
-0.09
Cfg
-0.09
shl
-0.09
shit
-0.09
åijĢ
-0.09
eus
-0.09
åĻ
-0.08
.aws
-0.08
impl
-0.08
POSITIVE LOGITS
fine
0.13
fine
0.12
Fine
0.12
Fine
0.11
FINE
0.10
You
0.10
bunch
0.10
daring
0.09
"You
0.09
Hey
0.09
Activations Density 0.067%