INDEX
Explanations
This neuron detects self-descriptive phrases indicating the model’s operation “based on” its training data.
New Auto-Interp
Negative Logits
Dealer
-0.06
↵↵↵
-0.06
ать
-0.06
"When
-0.06
SHOW
-0.06
Goals
-0.06
Two
-0.06
boots
-0.06
録
-0.06
12
-0.06
POSITIVE LOGITS
locksmith
0.07
(nr
0.07
div
0.07
sov
0.07
.ins
0.07
<QString
0.06
clf
0.06
Sob
0.06
LIB
0.06
.customer
0.06
Activations Density 0.028%