INDEX
Explanations
references to individuals, particularly focusing on occurrences of the word "person."
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
365
+0.17
0.9%
456
+0.11
0.6%
407
+0.10
0.6%
Correlated Neurons
Index
P. Corr.
Cos Sim.
365
+0.17
0.03
407
+0.11
0.03
485
+0.10
0.02
Negative Logits
Ĭ
-2.64
ĵ
-2.61
·¸
-2.53
·
-2.43
ı
-2.41
-2.40
↵↵
-2.40
-2.40
-2.40
↵
-2.40
POSITIVE LOGITS
nel
2.41
who
1.88
ila
1.86
uscript
1.81
nal
1.75
iscus
1.73
owns
1.70
ager
1.65
acles
1.65
arman
1.65
Activations Density 0.185%