INDEX
Explanations
The neuron mainly picks up on positive evaluative adjectives in the assistant’s self-descriptions—especially “helpful” (and related words like “accurate” or “safe”).
New Auto-Interp
Negative Logits
dependencies
-0.07
cro
-0.07
.service
-0.07
@app
-0.07
returning
-0.06
CX
-0.06
("(%-0.06
leaks
-0.06
kèm
-0.06
vtx
-0.06
POSITIVE LOGITS
apenas
0.07
가지고
0.06
_IM
0.06
utilized
0.06
polygons
0.06
زارش
0.06
setTimeout
0.06
koneč
0.06
Joe
0.06
resistant
0.06
Activations Density 0.020%