INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ILLE
    -0.82
    urses
    -0.68
    abama
    -0.65
    ITIES
    -0.65
    scill
    -0.63
    imation
    -0.61
    izabeth
    -0.61
    udder
    -0.61
    ught
    -0.61
    ITY
    -0.60
    POSITIVE LOGITS
     Twain
    1.25
    eting
    1.10
    eters
    1.09
     Zuckerberg
    1.08
    ipl
    0.99
    down
    0.98
    manship
    0.96
    erness
    0.92
    owitz
    0.91
    edly
    0.91
    Act Density 0.642%

    No Known Activations