INDEX
    Explanations

    words indicating contrast

    New Auto-Interp
    Negative Logits
    chnen
    0.43
    데요
    0.39
     zentral
    0.38
     réc
    0.38
     inder
    0.37
     dunque
    0.36
    hension
    0.36
    कश
    0.35
    agata
    0.35
     그러니까
    0.35
    POSITIVE LOGITS
     despite
    3.30
    despite
    2.94
     Despite
    2.78
     apesar
    2.78
     несмотря
    2.75
    Despite
    2.72
    尽管
    2.67
     although
    2.66
    虽然
    2.64
     meskipun
    2.64
    Act Density 0.170%

    No Known Activations