INDEX
    Explanations

    mentions of the word "well" followed by a number rating

    New Auto-Interp
    Negative Logits
    ategory
    -0.62
    TAG
    -0.59
     generated
    -0.58
     torn
    -0.58
    Ts
    -0.58
     glam
    -0.58
     finance
    -0.57
     style
    -0.57
     pric
    -0.57
     generation
    -0.56
    POSITIVE LOGITS
    well
    4.61
    Well
    1.48
     well
    1.32
     Well
    1.27
    hey
    1.19
    way
    1.17
    wait
    1.11
    we
    1.06
    worth
    1.02
    stable
    1.01
    Act Density 0.010%

    No Known Activations