INDEX
    Explanations

    negations and contrasts in text

    New Auto-Interp
    Negative Logits
    çļ
    -0.81
    kamp
    -0.76
    å¥
    -0.75
    çīĪ
    -0.72
    cano
    -0.70
    USH
    -0.68
    velt
    -0.68
    unders
    -0.66
    ongs
    -0.64
    ously
    -0.64
    POSITIVE LOGITS
     necessarily
    1.42
    icable
    1.23
    icably
    1.11
     exactly
    1.07
    eworthy
    1.03
    withstanding
    0.98
    orious
    0.97
     entirely
    0.97
     uncommon
    0.96
    epad
    0.95
    Act Density 0.548%

    No Known Activations