INDEX
Explanations
responses indicating agreement or affirmation
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
369
+0.16
0.9%
326
+0.15
0.9%
198
+0.11
0.6%
Correlated Neurons
Index
P. Corr.
Cos Sim.
390
+0.16
0.05
369
+0.15
0.02
339
+0.11
0.03
Negative Logits
¼
-2.31
ĨĴ
-1.95
ĭ
-1.94
Ļª
-1.93
º
-1.87
Ĺ
-1.84
©
-1.81
Ļ
-1.81
ľĵ
-1.79
²
-1.78
POSITIVE LOGITS
Answer
1.87
Answer
1.76
ologia
1.63
worry
1.59
Question
1.49
:**
1.47
answer
1.44
:_
1.38
objection
1.36
oxford
1.35
Activations Density 1.152%