INDEX
Explanations
instances of the word "official."
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
156
+0.20
1.2%
362
+0.14
0.8%
87
+0.12
0.7%
Correlated Neurons
Index
P. Corr.
Cos Sim.
446
+0.20
0.03
208
+0.14
0.03
266
+0.12
0.02
Negative Logits
latter
-1.55
obacterium
-1.53
oyle
-1.45
rose
-1.44
angers
-1.39
abeth
-1.38
anson
-1.38
ellow
-1.34
gado
-1.34
yman
-1.33
POSITIVE LOGITS
dom
1.98
doms
1.76
ships
1.73
pieces
1.67
bodies
1.56
blems
1.52
endar
1.50
hood
1.49
ities
1.46
esses
1.44
Activations Density 0.177%