INDEX
Explanations
possessive pronouns and terms related to attribution
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
50
+0.32
1.3%
1950
+0.09
0.3%
1150
+0.08
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1878
+0.32
0.03
1153
+0.09
0.02
1150
+0.08
0.01
Negative Logits
<bos>
-2.96
have
-0.62
and
-0.61
}{||-0.61
,
-0.61
꿔
-0.60
#![
-0.59
옮
-0.59
become
-0.58
protected
-0.58
POSITIVE LOGITS
thut
1.68
fta
1.60
Minang
1.57
stockholm
1.53
hcm
1.48
Juf
1.48
bandung
1.45
aen
1.45
desir
1.44
fte
1.44
Activations Density 0.126%