INDEX
Explanations
quotes or dialogue markers in the text
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
279
+0.12
0.6%
287
+0.11
0.6%
76
+0.11
0.6%
Correlated Neurons
Index
P. Corr.
Cos Sim.
188
+0.12
0.17
154
+0.11
0.14
486
+0.11
0.14
Negative Logits
³
-3.45
½
-3.27
ĥ½
-3.20
Ĥ
-3.19
Ħ
-3.14
↵
-3.08
↵
-3.08
↵
-3.08
↵ ↵
-3.08
↵
-3.08
POSITIVE LOGITS
itself
1.62
miser
1.43
gger
1.37
its
1.35
taire
1.35
me
1.33
ark
1.32
founded
1.29
stool
1.28
blic
1.28
Activations Density 0.411%