INDEX
Explanations
references to men and boys in the text
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
156
+0.20
1.1%
494
+0.18
1.0%
410
+0.11
0.6%
Correlated Neurons
Index
P. Corr.
Cos Sim.
494
+0.20
0.03
32
+0.18
0.03
81
+0.11
0.03
Negative Logits
↵
-3.01
↵
-3.01
-3.01
↵
-3.01
<|outofrange|>
-3.01
↵
-3.01
<|outofrange|>
-3.01
↵
-3.01
<|outofrange|>
-3.01
<|outofrange|>
-3.01
POSITIVE LOGITS
opause
2.73
acing
2.60
opausal
2.45
aces
2.35
iscus
2.13
ace
2.10
ager
1.87
jor
1.80
oon
1.79
orrh
1.79
Activations Density 0.130%