INDEX
    Explanations

    the letter 't' appearing with high activations

    instances of the negation "didn't."

    New Auto-Interp
    Negative Logits
     Reduced
    -0.64
     Compar
    -0.61
    itiz
    -0.61
     arsen
    -0.60
    lining
    -0.60
    edIn
    -0.59
    Pop
    -0.59
    soType
    -0.59
     Strongh
    -0.58
     Dise
    -0.58
    POSITIVE LOGITS
     bother
    0.99
     hesitate
    0.91
     necessarily
    0.89
     hesitated
    0.77
     exactly
    0.75
    apest
    0.75
    hes
    0.75
    actic
    0.75
     dare
    0.74
    ificate
    0.74
    Act Density 0.059%

    No Known Activations