INDEX
Explanations
appropriate
This neuron detects the token “appropriate.”
New Auto-Interp
Negative Logits
沉
-0.08
totals
-0.07
256
-0.07
Neuroscience
-0.07
&=
-0.07
985
-0.07
测
-0.07
Sanders
-0.07
tests
-0.07
880
-0.06
POSITIVE LOGITS
appropriate
0.16
appropriately
0.12
inappropriate
0.11
appropriate
0.11
ighet
0.09
ropriate
0.09
0.08
APT
0.08
opped
0.08
οπο
0.07
Activations Density 0.020%