INDEX
    Explanations

    proper names followed by colons

    statements and responses in a conversational or question-and-answer format

    New Auto-Interp
    Negative Logits
     downstream
    -0.69
     doub
    -0.67
     wrath
    -0.64
     forgotten
    -0.63
     unchecked
    -0.63
    abad
    -0.63
     rule
    -0.62
    foreseen
    -0.62
     trespass
    -0.61
     attention
    -0.61
    POSITIVE LOGITS
     Exactly
    1.15
     Yeah
    1.07
    Absolutely
    0.98
     Absolutely
    0.95
     Originally
    0.94
     Yes
    0.91
    Yeah
    0.90
     Provided
    0.88
     Firstly
    0.86
     Hmm
    0.86
    Act Density 0.030%

    No Known Activations