INDEX
    Explanations

    contractions of words, specifically finding instances of "didn't" with a strong activation value

    negative contractions and phrases that express negation

    New Auto-Interp
    Negative Logits
    planet
    -0.75
    amer
    -0.71
     rall
    -0.64
    accompan
    -0.64
    bard
    -0.63
    stre
    -0.62
     Britann
    -0.62
    antine
    -0.62
    rog
    -0.61
    Reviewer
    -0.60
    POSITIVE LOGITS
     necessarily
    1.03
     exactly
    1.02
     gonna
    0.95
     quite
    0.84
    urtles
    0.82
     gotta
    0.81
     kidding
    0.78
     really
    0.77
     bother
    0.76
     even
    0.75
    Act Density 0.069%

    No Known Activations