INDEX
Explanations
the main thing this neuron does is detect occurrences of the substring “access” in tokens.
New Auto-Interp
Negative Logits
Juliet
-0.08
glm
-0.07
диамет
-0.07
disastr
-0.07
seventeen
-0.07
hurricane
-0.07
Parade
-0.07
Ron
-0.07
Leonard
-0.07
27
-0.07
POSITIVE LOGITS
access
0.16
Access
0.15
Access
0.12
access
0.11
ACCESS
0.09
accessing
0.09
_access
0.09
ACCESS
0.09
-access
0.09
.Access
0.08
Activations Density 0.042%