INDEX
    Explanations

    Quotes/opinions

    This neuron detects the special header tokens (especially “<|start_header_id|>”) that mark the beginning of an assistant response.

    toxic or derogatory statements, especially hate speech targeting identity groups or prompts requesting such content.

    New Auto-Interp
    Negative Logits
     ALERT
    -0.06
     Rosen
    -0.06
     Hyde
    -0.06
    -0.06
    01
    -0.06
    ircon
    -0.06
     vượt
    -0.06
     Icelandic
    -0.06
     روند
    -0.06
    -0.06
    POSITIVE LOGITS
    'name
    0.08
    =true
    0.07
    vg
    0.07
    0.07
     Omega
    0.07
    ,同时
    0.06
    중에
    0.06
     یا
    0.06
     renamed
    0.06
    [array
    0.06
    Act Density 0.013%

    No Known Activations