INDEX
    Explanations

    mentions of the assistant identifying itself as an AI (self-referential statements about being an AI).

    New Auto-Interp
    Negative Logits
     PREFIX
    -0.07
    luv
    -0.06
     Belgian
    -0.06
     Administr
    -0.06
    der
    -0.06
    Franc
    -0.06
    -0.06
    GRE
    -0.06
    ISC
    -0.06
     affects
    -0.06
    POSITIVE LOGITS
     solicit
    0.07
     AI
    0.06
    exter
    0.06
    ){↵
    0.06
     diets
    0.06
    ysical
    0.06
    /#{
    0.06
     McKin
    0.06
    の子
    0.06
    ()){↵
    0.06
    Act Density 0.022%

    No Known Activations