INDEX
Explanations
Quotes/opinions
This neuron detects the special header tokens (especially “<|start_header_id|>”) that mark the beginning of an assistant response.
toxic or derogatory statements, especially hate speech targeting identity groups or prompts requesting such content.
New Auto-Interp
Negative Logits
ALERT
-0.06
Rosen
-0.06
Hyde
-0.06
뿐
-0.06
01
-0.06
ircon
-0.06
vượt
-0.06
Icelandic
-0.06
روند
-0.06
员
-0.06
POSITIVE LOGITS
'name
0.08
=true
0.07
vg
0.07
저
0.07
Omega
0.07
,同时
0.06
중에
0.06
یا
0.06
renamed
0.06
[array
0.06
Activations Density 0.013%