INDEX
    Explanations

    phrases indicating personal identity or self-description

    New Auto-Interp
    Negative Logits
    ences
    -0.16
    ison
    -0.16
    enci
    -0.16
    ako
    -0.15
    enia
    -0.14
    ence
    -0.14
    enson
    -0.14
    HQ
    -0.14
     Dav
    -0.14
    hq
    -0.14
    POSITIVE LOGITS
    lix
    0.15
    ÑĦÑĸк
    0.15
    _________________↵↵
    0.15
    커ìĬ¤
    0.14
    èĢ
    0.14
    CCR
    0.14
    .cls
    0.14
    inish
    0.14
    (PyObject
    0.14
    auc
    0.14
    Act Density 0.005%

    No Known Activations