INDEX
Explanations
words related to honesty and sincerity
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1363
+0.14
0.5%
889
+0.10
0.3%
896
+0.10
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1363
+0.14
0.03
368
+0.10
0.02
1598
+0.10
0.02
Negative Logits
desir
-1.27
increa
-1.25
disagre
-1.22
thut
-1.22
affor
-1.22
reluct
-1.21
accla
-1.21
?...
-1.18
emphat
-1.17
effe
-1.17
POSITIVE LOGITS
honest
1.17
honesty
1.05
Honest
0.97
honest
0.94
Honest
0.92
honestly
0.70
<bos>
0.66
truth
0.65
Honesty
0.58
ehrlich
0.57
Activations Density 0.052%