INDEX
Explanations
category
The neuron fires on Wikipedia‐style category tags (i.e. lines beginning with “Category:”).
New Auto-Interp
Negative Logits
supports
-0.07
stashop
-0.07
redirects
-0.06
promote
-0.06
freder
-0.06
changed
-0.06
derive
-0.06
realized
-0.06
드리
-0.06
volent
-0.06
POSITIVE LOGITS
Id
0.07
nicotine
0.07
itten
0.07
_ind
0.06
Issues
0.06
ousing
0.06
Jal
0.06
Habitat
0.06
aos
0.06
longitudinal
0.06
Activations Density 0.008%