INDEX
    Explanations

    expressions of surprise or disbelief

    New Auto-Interp
    Negative Logits
     segreg
    -0.67
     elim
    -0.66
     Lesbian
    -0.63
     Personality
    -0.62
     Luther
    -0.61
     Feld
    -0.61
     Liberia
    -0.60
     Townsend
    -0.60
     Spa
    -0.60
     Pixie
    -0.58
    POSITIVE LOGITS
    esome
    1.57
    akening
    1.28
    kward
    1.16
    alls
    1.03
    ards
    1.00
    iring
    0.95
    ake
    0.92
    orks
    0.91
    aw
    0.91
    reck
    0.90
    Act Density 0.009%

    No Known Activations