INDEX
Explanations
wikipedia articles
specific geographic locations or infrastructure terms.
The neuron selectively activates on the initial word of each article—i.e. the document’s title at the very start.
New Auto-Interp
Negative Logits
avadoc
-0.06
Array
-0.06
abortion
-0.06
flush
-0.06
.tabs
-0.06
kola
-0.06
ereotype
-0.06
anges
-0.06
ifiers
-0.06
=-=-
-0.06
POSITIVE LOGITS
ınızı
0.07
私
0.07
HAL
0.07
Gemini
0.06
BOOT
0.06
'm
0.06
(percent
0.06
Vertical
0.06
출장
0.06
výstav
0.06
Activations Density 0.010%