INDEX
    Explanations

    concepts related to mathematical or logical reasoning

    New Auto-Interp
    Negative Logits
     --
    -0.91
     --↵
    -0.68
    -0.63
     ---
    -0.60
     -
    -0.57
     âĢķ
    -0.56
     â
    -0.52
     --↵↵
    -0.52
     Â
    -0.51
     âĶĢ
    -0.48
    POSITIVE LOGITS
    "—
    0.27
    —is
    0.24
    —but
    0.24
    ">-->↵
    0.24
    —are
    0.23
    )—
    0.23
    —"
    0.23
    —which
    0.23
    ”—
    0.23
    —that
    0.22
    Act Density 0.661%

    No Known Activations