INDEX
    Explanations

    This neuron spots the assistant’s self-descriptive policy-and-safety statements, especially “I cannot/absolutely cannot” refusals based on its programming constraints.

    New Auto-Interp
    Negative Logits
     würde
    0.38
     või
    0.37
    0.36
     algum
    0.36
    would
    0.35
    もしくは
    0.35
     poderia
    0.34
     ataupun
    0.34
    就可以
    0.33
     would
    0.33
    POSITIVE LOGITS
    あくまで
    0.54
    0.36
     narzęd
    0.35
     teknoloj
    0.33
     computadora
    0.33
    TECHN
    0.33
     fundamentally
    0.33
     proizvoda
    0.33
    기술
    0.32
     inanimate
    0.32
    Act Density 0.579%

    No Known Activations