INDEX
Explanations
categories
The neuron detects tokens in Wikipedia category listings, especially the “Category:” lines at the ends of articles.
New Auto-Interp
Negative Logits
Hastings
-0.07
.Month
-0.06
selects
-0.06
355
-0.06
Flexible
-0.06
earned
-0.06
safety
-0.06
maintenance
-0.06
$q
-0.06
далі
-0.06
POSITIVE LOGITS
veget
0.07
の方
0.07
Af
0.07
_encoder
0.06
-*-
0.06
zurück
0.06
けど
0.06
assignment
0.06
haircut
0.06
Somebody
0.06
Activations Density 0.018%