INDEX
    Explanations

    contradictions or contrasts in statements

    New Auto-Interp
    Negative Logits
    chter
    -0.15
    ichen
    -0.15
    irts
    -0.15
    argar
    -0.15
    ÏĦζ
    -0.14
    amet
    -0.14
    rán
    -0.14
    pub
    -0.13
    ug
    -0.13
    egal
    -0.13
    POSITIVE LOGITS
     actually
    0.28
     Actually
    0.24
    actually
    0.22
    Actually
    0.22
    åħ¶å®ŀ
    0.18
     Nope
    0.16
    ensa
    0.16
    xFFF
    0.16
     eigentlich
    0.15
    ething
    0.15
    Act Density 0.163%

    No Known Activations