INDEX
    Explanations

    Human behavior

    This neuron activates on the prompt’s “behavior” indicator, i.e. the token introducing the specific behavior to evaluate.

    New Auto-Interp
    Negative Logits
    Seconds
    -0.07
    -0.07
    ('.')↵
    -0.07
    care
    -0.06
     bilingual
    -0.06
    puty
    -0.06
     "}";↵
    -0.06
    .ud
    -0.06
    uard
    -0.06
    	On
    -0.06
    POSITIVE LOGITS
     avantaj
    0.07
    abilidade
    0.07
    ("/{
    0.06
    /theme
    0.06
     деятельности
    0.06
     libr
    0.06
     نسخ
    0.06
    โรงแรม
    0.06
     Hệ
    0.06
    _mail
    0.06
    Act Density 0.002%

    No Known Activations