INDEX
    Explanations

    negations or phrases indicating refusal

    New Auto-Interp
    Negative Logits
    unden
    -0.16
    æ¨
    -0.15
    inand
    -0.14
    forgettable
    -0.14
    incr
    -0.14
    chner
    -0.14
     Ñģаме
    -0.13
    ,eg
    -0.13
    lain
    -0.13
    ibri
    -0.13
    POSITIVE LOGITS
     sure
    0.32
    sure
    0.30
     nearly
    0.24
     Sure
    0.23
    Sure
    0.23
     alone
    0.21
     anymore
    0.21
     necessarily
    0.20
     Nearly
    0.19
    alone
    0.19
    Act Density 0.127%

    No Known Activations