INDEX
    Explanations

    comparisons or preferences between two options, typically favoring one over the other

    comparisons that emphasize preference for one option over another

    New Auto-Interp
    Negative Logits
    minent
    -0.75
    amba
    -0.75
    mberg
    -0.75
    adium
    -0.75
    eria
    -0.74
    ruary
    -0.73
    ppo
    -0.73
    cision
    -0.72
    endale
    -0.71
    elaide
    -0.70
    POSITIVE LOGITS
     than
    0.78
     preferring
    0.72
     unimagin
    0.72
     pricey
    0.70
     trivial
    0.68
     Ide
    0.68
     inconvenient
    0.66
     innocuous
    0.66
     unpleasant
    0.65
     Leh
    0.65
    Act Density 0.015%

    No Known Activations