INDEX
Explanations
mentions of reading and blog posts
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
227
+0.10
0.3%
381
+0.10
0.3%
198
+0.07
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
613
+0.10
0.05
1973
+0.10
0.04
1056
+0.07
0.05
Negative Logits
nawr
-0.61
Abit
-0.55
DeleteBehavior
-0.51
⇨
-0.51
UIFont
-0.51
Charsets
-0.51
RTGC
-0.51
Paglinawan
-0.50
AfterEach
-0.50
Koordin
-0.50
POSITIVE LOGITS
disagre
1.35
intersper
1.33
apprehen
1.19
pamph
1.13
unspeak
1.12
unwarran
1.11
maneu
1.07
gaily
1.06
uninten
1.06
ftre
1.05
Activations Density 0.567%