INDEX
    Explanations

    negative phrases or words that highlight contradictions or issues

    New Auto-Interp
    Negative Logits
    ,
    -0.34
    's
    -0.23
     -
    -0.21
    `s
    -0.19
    ,↵
    -0.19
    -0.18
    &apos
    -0.18
    ’s
    -0.17
    ?s
    -0.17
    �s
    -0.17
    POSITIVE LOGITS
    are
    0.18
    )ìĿĢ
    0.14
    came
    0.14
    were
    0.14
    has
    0.14
    was
    0.14
    leta
    0.13
    ÈĻi
    0.13
    gle
    0.13
    ogn
    0.13
    Act Density 0.297%

    No Known Activations