INDEX
Explanations
pronouns referring to the reader or audience
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
50
+0.13
0.5%
1978
+0.05
0.2%
204
+0.05
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
100
+0.13
0.03
1566
+0.05
0.03
814
+0.05
0.02
Negative Logits
<bos>
-1.56
ⓧ
-1.33
<?
-1.07
-1.06
/**
-0.92
/***
-0.84
/*
-0.81
<?
-0.74
#
-0.70
/**
-0.66
POSITIVE LOGITS
maroc
1.09
meis
1.08
maneu
0.96
disreg
0.95
désol
0.95
endom
0.92
lamborghini
0.91
impra
0.90
italia
0.90
ibiza
0.90
Activations Density 0.107%