INDEX
Explanations
personal pronouns and possessive determiners
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1385
+0.11
0.3%
478
+0.11
0.3%
942
+0.09
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
303
+0.11
0.04
397
+0.11
0.03
478
+0.09
0.04
Negative Logits
maneu
-1.34
accla
-1.32
desir
-1.29
laun
-1.28
depic
-1.28
effe
-1.28
secon
-1.26
wien
-1.25
fuf
-1.25
fortn
-1.24
POSITIVE LOGITS
<bos>
0.97
teachings
0.64
kindness
0.63
latest
0.61
generosity
0.59
guidance
0.59
s
0.58
words
0.57
advice
0.56
approval
0.56
Activations Density 0.270%