INDEX
Explanations
The neuron activates on instances of the word “named.”
New Auto-Interp
Negative Logits
frage
-0.07
ительно
-0.07
(con
-0.07
Can
-0.07
go
-0.07
-stop
-0.07
do
-0.07
been
-0.07
دیگر
-0.07
check
-0.07
POSITIVE LOGITS
named
0.10
named
0.09
titled
0.09
unauthorized
0.08
Named
0.08
gifted
0.07
ured
0.07
Signed
0.07
knit
0.06
ed
0.06
Activations Density 0.017%