INDEX
    Explanations

    question answering

    This neuron activates on self-introductory disclaimers, particularly the phrase “As an AI language model.”

    New Auto-Interp
    Negative Logits
    iVar
    -0.07
    agree
    -0.07
     cade
    -0.07
     TRE
    -0.07
    question
    -0.07
     Gast
    -0.07
    协议
    -0.06
    ощи
    -0.06
    าณ
    -0.06
     observes
    -0.06
    POSITIVE LOGITS
    0.07
    [new
    0.06
     celý
    0.06
     Camping
    0.06
     Measurement
    0.06
     gây
    0.06
     YYSTYPE
    0.06
    *-
    0.06
    μό
    0.06
     hoje
    0.06
    Act Density 0.042%

    No Known Activations