INDEX
    Explanations

    phrases indicating clarity or certainty

    instances of the word "obvious."

    New Auto-Interp
    Negative Logits
    nan
    -0.86
    rams
    -0.77
    iership
    -0.74
     tightly
    -0.66
     ILCS
    -0.64
    ingers
    -0.64
     monitored
    -0.63
    ander
    -0.63
    ching
    -0.62
    uden
    -0.61
    POSITIVE LOGITS
     obvious
    1.03
    iary
    0.83
     contrad
    0.73
     Leilan
    0.73
    tale
    0.73
    Ùĩ
    0.72
     culprit
    0.71
     signs
    0.70
    \\\\\\\\
    0.70
     Signs
    0.69
    Act Density 0.009%

    No Known Activations