INDEX
Explanations
text related to ethical standards, morality, and breaches of ethics
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
251
+0.14
0.5%
1133
+0.14
0.5%
1350
+0.13
0.5%
Correlated Neurons
Index
P. Corr.
Cos Sim.
251
+0.14
0.02
1872
+0.14
0.02
1133
+0.13
0.02
Negative Logits
sebastian
-0.62
halle
-0.56
vinyle
-0.55
suga
-0.55
parma
-0.55
claudia
-0.55
cupa
-0.54
luis
-0.54
Molière
-0.54
Mlle
-0.53
POSITIVE LOGITS
ethics
1.44
Ethics
1.32
ethical
1.26
Ethics
1.26
ethics
1.16
Ethical
1.09
ethical
1.09
Ethical
1.06
ethically
1.04
ethic
0.98
Activations Density 0.061%