INDEX
Explanations
verbs related to actions or behaviors
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
334
+0.09
0.2%
1379
+0.08
0.2%
648
+0.08
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1893
+0.09
0.03
1592
+0.08
0.02
870
+0.08
0.03
Negative Logits
accla
-1.29
reluct
-1.23
fuf
-1.22
embra
-1.20
inev
-1.19
purcha
-1.19
strick
-1.16
shenan
-1.16
encomp
-1.15
depic
-1.13
POSITIVE LOGITS
seem
0.57
seems
0.54
sometimes
0.54
seemed
0.54
acknowledge
0.54
וגם
0.52
hancer
0.51
(!
0.51
nawet
0.51
hitheatre
0.51
Activations Density 0.269%