INDEX
Explanations
the phrase "what's going on."
New Auto-Interp
Negative Logits
ritical
-0.78
eah
-0.76
onso
-0.74
ply
-0.72
ĵ
-0.72
yes
-0.71
prise
-0.69
serv
-0.69
oning
-0.69
attery
-0.69
POSITIVE LOGITS
unfolding
0.92
inside
0.91
behind
0.85
underneath
0.82
unfold
0.81
here
0.79
elsewhere
0.79
inside
0.78
backstage
0.78
upstairs
0.78
Activations Density 0.033%