INDEX
Explanations
positive descriptions of actions or qualities
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
283
+0.10
0.4%
50
+0.08
0.3%
872
+0.07
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1826
+0.10
0.06
1703
+0.08
0.04
381
+0.07
0.01
Negative Logits
<bos>
-3.45
intersper
-2.18
encomp
-1.99
shenan
-1.76
inconce
-1.75
reluct
-1.71
unspeak
-1.71
hairc
-1.70
indestru
-1.70
impra
-1.65
POSITIVE LOGITS
asfal
1.05
torba
0.99
utop
0.99
tyn
0.99
ortop
0.99
sement
0.97
ananas
0.96
sonda
0.95
balon
0.95
benzin
0.94
Activations Density 2.013%