INDEX
    Explanations

    quotations with attributions

    binary responses or indicators of a conclusion

    New Auto-Interp
    Negative Logits
    anwhile
    -0.64
    avorite
    -0.62
    jri
    -0.62
     destro
    -0.58
    lished
    -0.57
     withd
    -0.56
    emale
    -0.54
     rall
    -0.53
    etheless
    -0.52
    essage
    -0.52
    POSITIVE LOGITS
    ")
    1.03
    "]
    1.02
    "—
    1.01
    ,"
    0.97
    ,'"
    0.95
    "),
    0.95
    %"
    0.95
    .")
    0.94
    ":
    0.93
    "?
    0.93
    Act Density 0.463%

    No Known Activations