INDEX
    Explanations

    words indicating contrast or contradiction

    New Auto-Interp
    Negative Logits
    xca
    -0.15
     bilm
    -0.15
    zin
    -0.15
    abar
    -0.14
    grab
    -0.14
    iswa
    -0.14
    .jackson
    -0.14
     Niet
    -0.13
     Barton
    -0.13
    .sky
    -0.13
    POSITIVE LOGITS
    DBC
    0.16
    åħĦå¼Ł
    0.15
     åĽ
    0.15
    daq
    0.15
     erw
    0.15
     SOS
    0.14
    oce
    0.14
     Kraft
    0.14
    orks
    0.14
    ekler
    0.13
    Act Density 0.028%

    No Known Activations