INDEX
    Explanations

    phrases that express skepticism or critique about common beliefs and notions

    New Auto-Interp
    Negative Logits
    154
    -0.15
    iards
    -0.15
    109
    -0.15
    care
    -0.14
    320
    -0.14
       
    -0.14
    ARRANT
    -0.14
    fc
    -0.14
    909
    -0.14
    iot
    -0.14
    POSITIVE LOGITS
    edy
    0.18
    udy
    0.17
    ewis
    0.16
    olini
    0.16
    erras
    0.15
    .experimental
    0.15
    agini
    0.14
    enthal
    0.14
    VF
    0.14
     obvious
    0.14
    Act Density 0.122%

    No Known Activations