INDEX
    Explanations

    mathematical symbols and notations used in equations

    New Auto-Interp
    Negative Logits
    I
    -0.40
        
    -0.39
    T
    -0.38
    ↵↵
    -0.38
    -0.37
    C
    -0.37
    P
    -0.36
    t
    -0.35
    L
    -0.35
    K
    -0.35
    POSITIVE LOGITS
    <unused43>
    0.95
    <unused14>
    0.95
    [@BOS@]
    0.94
    <unused42>
    0.94
    <unused41>
    0.94
    <unused74>
    0.94
    <unused51>
    0.94
    <unused28>
    0.94
    <unused1>
    0.93
    <unused3>
    0.93
    Act Density 0.309%

    No Known Activations