INDEX
Explanations
phrases starting with "that" where the following words are qualitatively describing something
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
674
+0.19
0.7%
161
+0.19
0.7%
1350
+0.14
0.5%
Correlated Neurons
Index
P. Corr.
Cos Sim.
161
+0.19
0.11
1023
+0.19
0.09
1350
+0.14
0.08
Negative Logits
lts
-1.11
lein
-1.08
aen
-1.05
affor
-1.05
Confu
-1.04
inappro
-1.03
unden
-1.03
Middles
-1.03
parch
-1.02
walter
-1.01
POSITIVE LOGITS
<bos>
0.97
that
0.85
THAT
0.79
that
0.77
THAT
0.70
That
0.69
That
0.65
dat
0.62
que
0.59
Eso
0.57
Activations Density 0.342%