INDEX
    Explanations

    specific natural phrasing

    tutorial-style, step-by-step explanations with structured lists and embedded code snippets, often around chat turn markers and explanatory breakdowns.

    New Auto-Interp
    Negative Logits
     BROWN
    0.43
    δικ
    0.42
     erine
    0.41
    开始
    0.40
    0.40
     yên
    0.39
    0.39
    要想
    0.38
     nổi
    0.37
    ρει
    0.37
    POSITIVE LOGITS
     staging
    0.45
    anstalt
    0.43
    fog
    0.42
     slap
    0.40
     gruesome
    0.40
     fake
    0.40
     follicles
    0.39
    ernacle
    0.39
     relentlessly
    0.39
    gifs
    0.38
    Act Density 22.870%

    No Known Activations