INDEX
Explanations
explanations or reasoning in a text
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1150
+0.09
0.3%
946
+0.09
0.3%
1438
+0.08
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
946
+0.09
0.05
1533
+0.09
0.02
1056
+0.08
0.05
Negative Logits
Lma
-1.24
Ikr
-1.23
FTFY
-1.21
Lmfao
-1.20
-1.08
uefa
-1.06
Noice
-1.03
sappi
-0.99
<?
-0.97
Yess
-0.93
POSITIVE LOGITS
they
0.71
that
0.67
he
0.62
it
0.62
she
0.60
if
0.60
there
0.59
we
0.57
you
0.57
told
0.57
Activations Density 0.309%