INDEX
    Explanations

    phrases that indicate doubt or contradiction

    New Auto-Interp
    Negative Logits
     still
    -0.17
     Still
    -0.16
    still
    -0.16
     STILL
    -0.15
     плÑİ
    -0.15
    _contin
    -0.14
    Still
    -0.14
    å°ļ
    -0.14
    onto
    -0.13
    serialize
    -0.13
    POSITIVE LOGITS
     WRONG
    0.40
     Wrong
    0.39
     reality
    0.36
    Wrong
    0.35
     well
    0.35
     wrong
    0.34
    well
    0.32
     Reality
    0.32
     Well
    0.32
     WELL
    0.30
    Act Density 0.220%

    No Known Activations