INDEX
    Explanations

    instances of the prefix "un" that convey a sense of negation or undesirability

    New Auto-Interp
    Negative Logits
    wards
    -0.15
    BN
    -0.15
    hq
    -0.15
    eval
    -0.15
    usercontent
    -0.15
    oir
    -0.14
    erior
    -0.14
     sle
    -0.14
    iske
    -0.14
    xb
    -0.14
    POSITIVE LOGITS
    lesi
    0.18
    elcome
    0.18
     Colomb
    0.17
     desirable
    0.16
    è
    0.16
    old
    0.16
    reck
    0.15
    ounded
    0.15
    inky
    0.15
    inkle
    0.15
    Act Density 0.005%

    No Known Activations