INDEX
Explanations
proper names related to a particular novel being analyzed
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
874
+0.08
0.3%
1323
+0.07
0.2%
122
+0.07
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1343
+0.08
0.04
1120
+0.07
0.04
1194
+0.07
0.03
Negative Logits
public
-0.81
-0.80
have
-0.79
engage
-0.79
raise
-0.79
<bos>
-0.78
can
-0.78
-0.78
-0.77
.
-0.77
POSITIVE LOGITS
fta
2.29
secon
2.22
strick
2.19
Len
2.19
dispen
2.19
aen
2.19
affor
2.19
effe
2.17
fuf
2.16
squa
2.15
Activations Density 0.152%