INDEX
    Explanations

    questions or statements involving questioning about actions or situations

    questions about information and understanding

    New Auto-Interp
    Negative Logits
    tails
    -0.85
    \\\\\\\\
    -0.79
    astered
    -0.77
    esm
    -0.76
    uned
    -0.76
    arget
    -0.76
    alde
    -0.72
    rovers
    -0.72
    icked
    -0.69
    tra
    -0.69
    POSITIVE LOGITS
     Baz
    0.72
     pige
    0.71
     calib
    0.70
     forgiveness
    0.70
     anybody
    0.69
     permission
    0.66
     possible
    0.66
     anyone
    0.66
     bothered
    0.65
     exactly
    0.64
    Act Density 0.085%

    No Known Activations