INDEX
Explanations
references to historical events or concepts
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
156
+0.20
1.1%
100
+0.14
0.8%
28
+0.12
0.7%
Correlated Neurons
Index
P. Corr.
Cos Sim.
424
+0.20
0.01
360
+0.14
0.02
30
+0.12
0.02
Negative Logits
fficient
-1.78
thinner
-1.66
evenly
-1.61
treated
-1.56
skilled
-1.54
done
-1.47
wers
-1.46
akin
-1.41
tolerate
-1.41
vious
-1.41
POSITIVE LOGITS
¡
1.75
grounds
1.74
ités
1.66
isation
1.66
ignment
1.59
oire
1.58
icity
1.55
affairs
1.51
atroc
1.46
istically
1.45
Activations Density 0.201%