INDEX
    Explanations

    phrases indicating negation or potential misunderstanding

    phrases that express caution or reassurance

    New Auto-Interp
    Negative Logits
    ilogy
    -0.76
    anded
    -0.68
    alist
    -0.66
    ially
    -0.64
    figured
    -0.64
    azo
    -0.62
    ranch
    -0.62
    ettlement
    -0.62
    atar
    -0.62
    erial
    -0.61
    POSITIVE LOGITS
     yourself
    1.05
     yourselves
    1.02
     anymore
    0.93
     Yourself
    0.85
     ANY
    0.80
     any
    0.79
     your
    0.75
     whining
    0.70
     anything
    0.70
     YOUR
    0.68
    Act Density 0.123%

    No Known Activations