INDEX
Explanations
instances where references are made to being transparent about mistakes, including grammar-related errors and personal errors
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1525
+0.08
0.2%
674
+0.07
0.2%
648
+0.07
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1415
+0.08
0.02
1525
+0.07
0.03
1122
+0.07
0.03
Negative Logits
Majest
-0.74
vnt
-0.70
addirittura
-0.66
impra
-0.66
Juf
-0.65
fta
-0.65
Augu
-0.64
thut
-0.64
délib
-0.63
Bibl
-0.62
POSITIVE LOGITS
<bos>
0.80
inevitable
0.65
inevitably
0.65
sometime
0.60
occasional
0.59
occasionally
0.56
eventually
0.56
SneakyThrows
0.54
CodedInputStream
0.53
somewhere
0.50
Activations Density 0.367%