INDEX
Explanations
The main thing this neuron does is detect occurrences of the word “filter.”
New Auto-Interp
Negative Logits
overseeing
-0.07
nen
-0.07
Stan
-0.07
Stanley
-0.07
conceived
-0.07
ad
-0.07
286
-0.07
Nou
-0.07
23
-0.06
Conan
-0.06
POSITIVE LOGITS
filter
0.14
Filter
0.13
Filter
0.12
filters
0.11
filter
0.11
FILTER
0.10
FILTER
0.09
filtr
0.09
filtered
0.09
filtered
0.09
Activations Density 0.019%