INDEX
    Explanations

    instances of self-reflection and expressions of doubt or criticism

    New Auto-Interp
    Negative Logits
    Impossible
    -0.17
    riere
    -0.16
    rière
    -0.16
    lech
    -0.15
    gili
    -0.15
    YLON
    -0.15
    /apis
    -0.14
     Impossible
    -0.14
    uye
    -0.14
    krom
    -0.14
    POSITIVE LOGITS
     na
    0.35
     mistake
    0.32
     naive
    0.30
     naï
    0.28
     foolish
    0.27
     mistakes
    0.27
     folly
    0.26
    éĶĻ误
    0.25
     Na
    0.25
     mistaken
    0.24
    Act Density 0.037%

    No Known Activations