INDEX
Explanations
The neuron detects the appearance of the “condone / promote” refusal phrasing (i.e. the split tokens cond + one and promote) in the assistant’s disclaimers.
New Auto-Interp
Negative Logits
Escape
-0.07
.Cascade
-0.06
search
-0.06
setObject
-0.06
-community
-0.06
looping
-0.06
Stay
-0.06
"','"
-0.06
guitars
-0.06
-----------
-0.06
POSITIVE LOGITS
způsob
0.07
gg
0.07
هه
0.07
HEL
0.07
UTH
0.06
дан
0.06
ops
0.06
fov
0.06
대회
0.06
marching
0.06
Activations Density 0.010%