INDEX
    Explanations

    the word "nothing" followed by high activations

    negative assertions and phrases emphasizing nullity or insignificance

    New Auto-Interp
    Negative Logits
    PLA
    -0.67
    landers
    -0.63
    uctions
    -0.59
    eus
    -0.59
     Bots
    -0.58
     transitions
    -0.58
     decline
    -0.57
    eton
    -0.57
     prohibitions
    -0.57
     downs
    -0.57
    POSITIVE LOGITS
    lled
    0.86
    avering
    0.75
    bered
    0.74
    umbn
    0.73
    arily
    0.73
    ient
    0.72
    ĸļ
    0.69
     akin
    0.69
    itter
    0.68
    ozyg
    0.68
    Act Density 0.100%

    No Known Activations