INDEX
Explanations
The neuron fires on any occurrence of the substring “diff” (in any context or casing).
New Auto-Interp
Negative Logits
Kral
-0.07
onal
-0.07
Monument
-0.07
utut
-0.07
Pon
-0.07
вну
-0.07
man
-0.07
зан
-0.07
ju
-0.07
Banner
-0.07
POSITIVE LOGITS
diff
0.13
diff
0.12
DIFF
0.11
_diff
0.10
Diff
0.10
DIFF
0.09
Diff
0.09
(diff
0.09
diffs
0.08
diffuse
0.08
Activations Density 0.009%