INDEX
    Explanations

    contractions with "n't"

    negations or terms expressing disagreement

    New Auto-Interp
    Negative Logits
    don
    -0.71
    rog
    -0.69
    çļ
    -0.68
    iers
    -0.68
    inav
    -0.65
    èĪ
    -0.64
    antine
    -0.63
    cano
    -0.62
    inen
    -0.62
     Publications
    -0.61
    POSITIVE LOGITS
     exactly
    1.12
     necessarily
    1.10
     gonna
    1.01
     quite
    0.97
     supposed
    0.87
     really
    0.86
    epad
    0.85
     kidding
    0.84
    icable
    0.84
     bothering
    0.82
    Act Density 0.077%

    No Known Activations