INDEX
    Explanations

    This neuron activates on “jailbreak”-style prompt tags (e.g. tokens forming “[JAILBREAK]” or “[CLASSIC]”).

    New Auto-Interp
    Negative Logits
     вывод
    -0.07
    (Un
    -0.07
    418
    -0.06
     Sunset
    -0.06
     Sundays
    -0.06
     Hammond
    -0.06
     caracter
    -0.06
     kata
    -0.06
    โดย
    -0.06
    (B
    -0.06
    POSITIVE LOGITS
    	help
    0.07
     přibliž
    0.07
     seviy
    0.06
     فراهم
    0.06
    ちゃ
    0.06
     kafka
    0.06
     getKey
    0.06
    ừng
    0.06
    .showMessage
    0.06
    0.06
    Act Density 0.003%

    No Known Activations