INDEX
Explanations
out words related to researching or exploring a topic in depth
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
674
+0.16
0.5%
1068
+0.15
0.5%
814
+0.11
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1068
+0.16
0.03
1056
+0.15
0.02
814
+0.11
0.01
Negative Logits
emphat
-1.25
fte
-1.10
effe
-1.08
wien
-1.08
fta
-1.06
intermitt
-1.04
affor
-1.03
reluct
-1.03
perfet
-1.03
accla
-1.01
POSITIVE LOGITS
learn
0.85
<bos>
0.77
learn
0.75
Learn
0.70
Learn
0.68
learns
0.68
learned
0.66
how
0.66
discover
0.66
out
0.66
Activations Density 0.055%