INDEX
    Explanations

    expressions of confusion or frustration

    New Auto-Interp
    Negative Logits
    Oops
    -0.16
    oops
    -0.16
    drv
    -0.16
     Beard
    -0.15
    Crud
    -0.15
    fait
    -0.15
     åĵ
    -0.14
    åĹ¯
    -0.14
    pron
    -0.14
     Hmm
    -0.14
    POSITIVE LOGITS
     why
    0.33
     Why
    0.26
     seriously
    0.24
    why
    0.24
     WHY
    0.23
     surely
    0.22
     how
    0.22
    Seriously
    0.21
    Why
    0.20
    为ä»Ģä¹Ī
    0.19
    Act Density 0.310%

    No Known Activations