INDEX
Explanations
instances where the word "explain" is used
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
376
+0.21
1.2%
115
+0.15
0.9%
372
+0.12
0.7%
Correlated Neurons
Index
P. Corr.
Cos Sim.
167
+0.21
0.02
372
+0.15
0.02
345
+0.12
0.02
Negative Logits
ubicin
-1.75
brow
-1.75
.]{}-1.72
\]]{}-1.51
\])]{}-1.47
)];
-1.46
)]{}-1.46
s
-1.43
]{}.-1.43
Bg
-1.40
POSITIVE LOGITS
why
1.89
how
1.55
ķ
1.54
tera
1.52
ably
1.51
famine
1.49
ĸ
1.44
Ł
1.43
partum
1.41
error
1.35
Activations Density 1.150%