INDEX
Explanations
instances of hypocrisy in various contexts
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
510
+0.13
0.8%
468
+0.13
0.7%
111
+0.11
0.6%
Correlated Neurons
Index
P. Corr.
Cos Sim.
468
+0.13
0.01
320
+0.13
0.01
510
+0.11
0.01
Negative Logits
ı
-2.94
·¸
-2.82
¿½
-2.77
Ķ
-2.74
»
-2.74
Ĺ
-2.57
¿
-2.55
Ļª
-2.46
ij
-2.46
ķ
-2.35
POSITIVE LOGITS
enium
1.85
inates
1.70
unity
1.59
fors
1.55
erals
1.54
ató
1.51
ÃŃvel
1.45
bows
1.44
zilla
1.44
issance
1.43
Activations Density 0.003%