INDEX
    Explanations

    The neuron fires on occurrences of alignment‐related keywords (e.g. “aligned,” “alignment,” etc.) in the code.

    New Auto-Interp
    Negative Logits
    检测
    -0.07
    _tel
    -0.07
    ,new
    -0.07
     tcb
    -0.06
     Null
    -0.06
        
    -0.06
     Happiness
    -0.06
     interceptor
    -0.06
    peace
    -0.06
    -0.06
    POSITIVE LOGITS
    із
    0.06
    یمی
    0.06
     grâce
    0.06
    Adam
    0.06
    پر
    0.06
    0.06
     أبي
    0.06
    illions
    0.06
    0.06
     ابت
    0.06
    Act Density 0.001%

    No Known Activations