INDEX
    Explanations

    phrases expressing refusal or opposition

    negations and expressions of refusal

    New Auto-Interp
    Negative Logits
     nonetheless
    -0.71
     nevertheless
    -0.70
    ãĤ¼
    -0.69
    ãĥ¼ãĥĨ
    -0.66
     unmist
    -0.65
    ãĥ¯ãĥ³
    -0.64
     invariably
    -0.64
     swiftly
    -0.64
    senal
    -0.63
     simultaneously
    -0.62
    POSITIVE LOGITS
    hin
    1.34
     fuckin
    1.14
     wanna
    1.11
     deserve
    1.09
     fucking
    1.00
     gonna
    0.95
     belong
    0.92
     condone
    0.92
     exist
    0.91
     EVEN
    0.90
    Act Density 0.265%

    No Known Activations