INDEX
    Explanations

    the word "surprise" or similar variations

    phrases indicating a lack of surprise or expected outcomes

    New Auto-Interp
    Negative Logits
    eatures
    -0.84
    minster
    -0.79
    chnology
    -0.77
    İĭ
    -0.76
    rote
    -0.74
    adle
    -0.72
    folios
    -0.71
    nai
    -0.71
    rogram
    -0.70
    ettel
    -0.70
    POSITIVE LOGITS
     whatsoever
    0.87
     anymore
    0.82
     nor
    0.78
     surprises
    0.67
     surprise
    0.66
    imaru
    0.65
     prompts
    0.64
     why
    0.64
     enough
    0.63
    REDACTED
    0.63
    Act Density 0.021%

    No Known Activations