INDEX
    Explanations

    words indicating a contradiction or alternative viewpoint

    contrasting phrases that introduce shifts in arguments

    New Auto-Interp
    Negative Logits
    meta
    -0.66
    zero
    -0.62
    ILLE
    -0.61
    pert
    -0.60
    emn
    -0.59
    imgur
    -0.57
    ogg
    -0.57
    SI
    -0.57
    coat
    -0.56
    ruit
    -0.56
    POSITIVE LOGITS
     rather
    1.90
    rather
    1.50
     instead
    1.39
     Rather
    1.31
     merely
    1.19
    Rather
    1.12
    instead
    1.07
    Instead
    1.06
     nevertheless
    1.04
     Instead
    1.02
    Act Density 0.077%

    No Known Activations