INDEX
Explanations
words related to lists or items in a list
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1047
+0.17
0.7%
406
+0.13
0.5%
555
+0.13
0.5%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1047
+0.17
0.03
406
+0.13
0.01
1616
+0.13
0.02
Negative Logits
reluct
-0.67
Godt
-0.66
horrend
-0.66
unspeak
-0.65
spartan
-0.64
cuck
-0.62
spind
-0.60
enthusi
-0.59
celtic
-0.59
apprehen
-0.59
POSITIVE LOGITS
•
0.87
•
0.82
.•
0.71
)•
0.69
••
0.68
("")]
0.64
°•
0.61
·
0.58
~•
0.58
(::
0.57
Activations Density 0.108%