INDEX
Explanations
phrases related to proving a point or showcasing accomplishments
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
272
+0.08
0.2%
674
+0.07
0.2%
1253
+0.07
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
310
+0.08
0.03
851
+0.07
0.03
584
+0.07
0.02
Negative Logits
teras
-0.68
はじめに
-0.62
ert
-0.59
NKC
-0.58
kram
-0.57
Okt
-0.56
kaka
-0.55
kano
-0.55
miy
-0.54
lele
-0.54
POSITIVE LOGITS
prove
1.07
disprove
1.03
proves
0.93
proving
0.91
proved
0.86
Prove
0.80
proof
0.80
prove
0.76
defy
0.76
defied
0.75
Activations Density 0.375%