INDEX
    Explanations

    references to hypocrisy and double standards in behavior or beliefs

    New Auto-Interp
    Negative Logits
     cannot
    -0.20
     Cannot
    -0.19
    cannot
    -0.17
    Cannot
    -0.16
     Ø£ÙĬضا
    -0.15
    orig
    -0.15
     is
    -0.14
     Dont
    -0.14
     am
    -0.14
    ãĤ¤ãĥ³ãĥĪ
    -0.14
    POSITIVE LOGITS
    're
    0.47
    've
    0.42
    'll
    0.42
    ’re
    0.40
    'd
    0.37
    'm
    0.36
    ’ll
    0.35
    ’ve
    0.35
    ’d
    0.30
    ’m
    0.29
    Act Density 1.011%

    No Known Activations