INDEX
Explanations
The neuron fires on the word “Sign,” especially when it appears as the first token of section headings or titles.
New Auto-Interp
Negative Logits
เช
-0.06
Economics
-0.06
upd
-0.06
[:]
-0.06
-release
-0.06
438
-0.06
clusters
-0.06
WAL
-0.06
Prel
-0.06
_intro
-0.06
POSITIVE LOGITS
sign
0.10
Sign
0.09
Sign
0.08
\',
0.07
ΗΜΑ
0.06
SIGN
0.06
...,
0.06
_rewrite
0.06
dig
0.06
YG
0.06
Activations Density 0.002%