INDEX
    Explanations

    mentions of specific names or topics

    the word "mentioned" and its variations in various contexts

    New Auto-Interp
    Negative Logits
    uilt
    -0.75
    eware
    -0.70
    usterity
    -0.70
    earned
    -0.69
    otypes
    -0.69
    orneys
    -0.69
    iership
    -0.68
    inals
    -0.67
    heres
    -0.67
    onew
    -0.66
    POSITIVE LOGITS
     mentioning
    1.03
     mentions
    0.97
     mentioned
    0.84
    lihood
    0.80
     aloud
    0.80
     mention
    0.76
     Prelude
    0.71
     above
    0.71
    REDACTED
    0.68
     prominently
    0.68
    Act Density 0.010%

    No Known Activations