INDEX
    Explanations

    instances of questions or rhetorical questions

    New Auto-Interp
    Negative Logits
    hya
    -0.16
    alar
    -0.14
    oplay
    -0.14
    اگ
    -0.14
    ĵ¨
    -0.14
    oine
    -0.14
    ancy
    -0.14
    leys
    -0.13
    cke
    -0.13
    ih
    -0.13
    POSITIVE LOGITS
     well
    0.21
     Glad
    0.21
     Well
    0.19
     simple
    0.18
     answer
    0.18
    Well
    0.17
     Answer
    0.17
     nothing
    0.17
     exactly
    0.17
     simply
    0.16
    Act Density 0.128%

    No Known Activations