INDEX
Explanations
This neuron detects meta‐instruction language, especially the word “role” and related role-play directives.
New Auto-Interp
Negative Logits
/\
-0.07
.MainActivity
-0.06
activations
-0.06
Comput
-0.06
步
-0.06
nick
-0.06
deren
-0.06
Pok
-0.06
forgive
-0.06
stin
-0.06
POSITIVE LOGITS
Venezuelan
0.07
Auckland
0.07
Metals
0.06
clearfix
0.06
선거
0.06
thực
0.06
absolute
0.06
categor
0.06
MLA
0.06
ısından
0.06
Activations Density 0.001%