INDEX
Explanations
phrases related to ethical behavior and societal issues
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
330
+0.11
0.3%
581
+0.10
0.3%
2036
+0.08
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
330
+0.11
0.04
1257
+0.10
0.04
1592
+0.08
0.04
Negative Logits
severally
-0.55
Manufact
-0.54
pooh
-0.52
Righ
-0.52
accla
-0.50
mew
-0.50
snoopy
-0.50
philanth
-0.49
swee
-0.48
Ename
-0.48
POSITIVE LOGITS
whatsoever
0.98
nor
0.67
except
0.60
Saluti
0.56
anymore
0.55
except
0.53
estekak
0.51
idać
0.50
المشاركات
0.49
<bos>
0.47
Activations Density 0.326%