INDEX
    Explanations

    phrases that contrast viewpoints or actions between different entities

    references to "others" in various contexts

    New Auto-Interp
    Negative Logits
    "},"
    -0.82
    Opening
    -0.69
    Pen
    -0.68
    Awesome
    -0.65
    Alright
    -0.64
    United
    -0.64
    SI
    -0.63
    Rated
    -0.63
     Annotations
    -0.62
    RNA
    -0.62
    POSITIVE LOGITS
     merely
    1.08
     simply
    1.05
     prefer
    0.94
     succumb
    0.89
     remain
    0.80
     cling
    0.79
     rely
    0.78
     just
    0.78
     are
    0.77
     opt
    0.76
    Act Density 0.124%

    No Known Activations