INDEX
Explanations
differences
The neuron is sharply tuned to detect the word “differences” (and very similar comparative nouns like “disparities”) in the text.
New Auto-Interp
Negative Logits
Newton
-0.08
headed
-0.08
Studio
-0.07
oul
-0.07
ANT
-0.07
novels
-0.06
.cljs
-0.06
squat
-0.06
.ViewModel
-0.06
Weston
-0.06
POSITIVE LOGITS
differences
0.14
Differences
0.13
difference
0.12
Difference
0.11
differ
0.11
difference
0.11
ifferences
0.09
_difference
0.09
differed
0.08
Difference
0.08
Activations Density 0.029%