INDEX
    Explanations

    comparative phrases indicating differences

    comparisons between entities or concepts

    New Auto-Interp
    Negative Logits
    gro
    -0.54
     (>
    -0.53
    ustain
    -0.53
    spr
    -0.53
    itans
    -0.53
    ources
    -0.52
     Deadline
    -0.51
    ioch
    -0.51
    aturday
    -0.51
    ilda
    -0.51
    POSITIVE LOGITS
     differently
    2.03
     different
    1.86
    different
    1.74
     similar
    1.53
     opposite
    1.46
     Different
    1.43
     worse
    1.35
     identical
    1.33
     similarly
    1.31
     simpler
    1.30
    Act Density 0.970%

    No Known Activations