INDEX
    Explanations

    instances of the word "not."

    New Auto-Interp
    Negative Logits
    lical
    -0.15
    ilerden
    -0.15
    781
    -0.15
    olated
    -0.15
    halt
    -0.15
    ymous
    -0.14
    μά
    -0.14
     navr
    -0.14
    hetto
    -0.14
    ussy
    -0.13
    POSITIVE LOGITS
     everyone
    0.28
     everybody
    0.25
     knowing
    0.23
    everyone
    0.23
    ori
    0.23
    ches
    0.22
     everything
    0.20
     only
    0.20
     surprisingly
    0.20
    tingham
    0.19
    Act Density 0.037%

    No Known Activations