INDEX
    Explanations

    words related to surprise or unexpected outcomes

    New Auto-Interp
    Negative Logits
    ArrowToggle
    -0.72
    Autoritní
    -0.64
     weird
    -0.63
    Nero
    -0.63
     bizarre
    -0.62
    .*")]
    -0.62
    Controllo
    -0.62
     crazy
    -0.61
    weird
    -0.60
     Kat
    -0.60
    POSITIVE LOGITS
     Surprise
    0.88
    ***************/
    0.81
    Surprise
    0.81
     surprise
    0.74
     surprises
    0.72
    surprise
    0.69
    prises
    0.68
     Suf
    0.68
    AddTagHelper
    0.68
     XCTest
    0.68
    Act Density 0.010%

    No Known Activations