INDEX
Explanations
specific natural phrasing
tutorial-style, step-by-step explanations with structured lists and embedded code snippets, often around chat turn markers and explanatory breakdowns.
New Auto-Interp
Negative Logits
BROWN
0.43
δικ
0.42
erine
0.41
开始
0.40
는
0.40
yên
0.39
မ
0.39
要想
0.38
nổi
0.37
ρει
0.37
POSITIVE LOGITS
staging
0.45
anstalt
0.43
fog
0.42
slap
0.40
gruesome
0.40
fake
0.40
follicles
0.39
ernacle
0.39
relentlessly
0.39
gifs
0.38
Activations Density 22.870%