INDEX
    Explanations

    negations and refusals in the text

    New Auto-Interp
    Negative Logits
    nty
    -0.15
    ãĤ¸ãĤ¢
    -0.15
    ntag
    -0.14
    omy
    -0.14
    afort
    -0.13
    OVE
    -0.13
    اÛĮØ´
    -0.13
    awns
    -0.12
    ='".
    -0.12
    롯
    -0.12
    POSITIVE LOGITS
     necessarily
    0.41
     mind
    0.34
     ever
    0.32
     even
    0.31
     exactly
    0.29
     dare
    0.27
     necessary
    0.26
     EVER
    0.25
     bother
    0.25
     anymore
    0.25
    Act Density 0.170%

    No Known Activations