INDEX
Explanations
terms related to altruism and cooperative behaviors
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
376
+0.14
0.8%
189
+0.12
0.7%
415
+0.11
0.6%
Correlated Neurons
Index
P. Corr.
Cos Sim.
189
+0.14
0.02
415
+0.12
0.02
302
+0.11
0.02
Negative Logits
forget
-1.93
wise
-1.77
minds
-1.70
assadors
-1.66
'?"
-1.64
obbsee
-1.63
ters
-1.48
notice
-1.46
orers
-1.46
)){-1.45
POSITIVE LOGITS
enstein
1.99
xton
1.98
cellar
1.47
ÅĽci
1.47
billing
1.42
ford
1.42
ppo
1.41
ende
1.40
pee
1.39
âĢŁ
1.38
Activations Density 0.018%