INDEX
    Explanations

    phrases indicating surprise or unexpectedness

    New Auto-Interp
    Negative Logits
    enderror
    -0.15
     incons
    -0.14
    IXEL
    -0.13
     Schwe
    -0.13
    aight
    -0.13
    _sensitive
    -0.13
    ewire
    -0.13
    اتÙĩ
    -0.13
    bsub
    -0.13
    _FAULT
    -0.13
    POSITIVE LOGITS
     surprise
    0.88
     surprises
    0.77
     Surprise
    0.76
     surprised
    0.68
     surpr
    0.66
     surprising
    0.60
    sur
    0.59
     Sur
    0.57
    -sur
    0.56
     unexpected
    0.55
    Act Density 0.301%

    No Known Activations