INDEX
    Explanations

    code comments with following action

    New Auto-Interp
    Negative Logits
    `
    1.20
    1.13
    `.
    1.07
    `,
    1.04
    ↵↵
    0.98
    ”.
    0.98
    :
    0.97
    0.95
    `:
    0.93
    `).
    0.92
    POSITIVE LOGITS
     ---------
    1.68
     ------
    1.66
     ----
    1.64
     -----
    1.64
     ======
    1.64
     -------
    1.60
     ----------
    1.55
     =====
    1.53
     --------
    1.51
     -----------
    1.50
    Act Density 0.167%

    No Known Activations