INDEX
Explanations
expressions of acceptable social behavior or moral standing
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
198
+0.24
1.5%
400
+0.23
1.4%
186
+0.20
1.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
400
+0.24
0.10
198
+0.23
0.13
157
+0.20
0.14
Negative Logits
onica
-1.64
quart
-1.63
Archives
-1.56
pora
-1.52
Awards
-1.51
éric
-1.50
chaft
-1.42
Edited
-1.40
offices
-1.37
amycin
-1.36
POSITIVE LOGITS
won
1.60
addy
1.50
ections
1.49
ickets
1.48
&&
1.48
![
1.43
anything
1.38
osition
1.38
olver
1.37
exists
1.36
Activations Density 3.161%