INDEX
Explanations
Analyses
This neuron detects user instructions asking the model to analyze, critique, evaluate, or review content.
New Auto-Interp
Negative Logits
Buffer
-0.07
σετε
-0.07
_buffer
-0.07
ूर
-0.07
relax
-0.06
dine
-0.06
oping
-0.06
uvwxyz
-0.06
frustrations
-0.06
}">↵
-0.06
POSITIVE LOGITS
الم
0.07
действ
0.07
seriously
0.06
düş
0.06
sont
0.06
silicon
0.06
vešker
0.06
placements
0.06
Savaşı
0.06
legitimacy
0.06
Activations Density 0.041%