INDEX
    Explanations

    question and answer formats in the text

    New Auto-Interp
    Negative Logits
     outweigh
    -0.70
    atten
    -0.68
     outwe
    -0.68
    guards
    -0.68
    ords
    -0.66
     coales
    -0.66
     comple
    -0.64
    oval
    -0.63
     bloom
    -0.61
    umblr
    -0.60
    POSITIVE LOGITS
     Why
    0.97
     WHY
    0.94
     What
    0.92
     Explain
    0.92
    Why
    0.87
    Hi
    0.83
     Hello
    0.83
    Hello
    0.80
     How
    0.79
     Suppose
    0.79
    Act Density 0.054%

    No Known Activations