INDEX
Explanations
This neuron activates on “jailbreak”-style prompt tags (e.g. tokens forming “[JAILBREAK]” or “[CLASSIC]”).
New Auto-Interp
Negative Logits
вывод
-0.07
(Un
-0.07
418
-0.06
Sunset
-0.06
Sundays
-0.06
Hammond
-0.06
caracter
-0.06
kata
-0.06
โดย
-0.06
(B
-0.06
POSITIVE LOGITS
help
0.07
přibliž
0.07
seviy
0.06
فراهم
0.06
ちゃ
0.06
kafka
0.06
getKey
0.06
ừng
0.06
.showMessage
0.06
什
0.06
Activations Density 0.003%