INDEX
Explanations
periods or other punctuation marks at the end of sentences
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
437
+0.13
0.7%
435
+0.12
0.7%
110
+0.12
0.6%
Correlated Neurons
Index
P. Corr.
Cos Sim.
133
+0.13
0.08
258
+0.12
0.07
419
+0.12
0.06
Negative Logits
elves
-1.75
ality
-1.56
)](
-1.54
chool
-1.53
\]](
-1.52
moderate
-1.48
ffect
-1.45
eling
-1.43
-->
-1.42
woke
-1.41
POSITIVE LOGITS
Ń
4.07
¬
3.71
↵
3.46
3.46
3.46
↵
3.46
↵
3.46
3.46
3.46
↵
3.46
Activations Density 0.115%