INDEX
    Explanations

    Speaking and reacting

    This neuron detects language where the user is trying to coerce or manipulate the assistant into producing harmful or disallowed content.

    New Auto-Interp
    Negative Logits
     отмет
    -0.07
    -0.07
    ную
    -0.07
    ный
    -0.07
     dol
    -0.07
    -tests
    -0.06
    	        	
    -0.06
     نوفمبر
    -0.06
    ном
    -0.06
     cat
    -0.06
    POSITIVE LOGITS
     unequiv
    0.07
    	opt
    0.06
    (`↵
    0.06
    .asp
    0.06
    人口
    0.06
    лаж
    0.06
    Nike
    0.06
    	onClick
    0.06
     goodwill
    0.06
    .erb
    0.06
    Act Density 0.005%

    No Known Activations