INDEX
Explanations
content related to personal reflection and introspection
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
332
+0.11
0.3%
1372
+0.09
0.3%
1264
+0.09
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1372
+0.11
0.05
332
+0.09
0.04
1515
+0.09
0.02
Negative Logits
reluct
-1.39
emphat
-1.35
accla
-1.27
shenan
-1.26
disagre
-1.23
milf
-1.20
indestru
-1.19
depic
-1.16
strick
-1.15
maneu
-1.14
POSITIVE LOGITS
thinking
0.78
thoughts
0.73
thinking
0.72
thought
0.67
💭
0.66
Think
0.66
Thinking
0.65
think
0.65
Think
0.65
Thinking
0.63
Activations Density 0.287%