INDEX
Explanations
sentences describing a character's experiences and thoughts
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
50
+0.11
0.3%
1577
+0.10
0.3%
1533
+0.10
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1533
+0.11
0.01
184
+0.10
0.02
194
+0.10
0.02
Negative Logits
milano
-1.20
„,
-1.19
embra
-1.11
hcm
-1.09
increa
-1.09
voleva
-1.09
guarante
-1.09
dispen
-1.08
robus
-1.08
doman
-1.08
POSITIVE LOGITS
knew
0.93
hadn
0.91
had
0.86
was
0.80
seemed
0.80
wasn
0.79
had
0.74
were
0.72
couldn
0.72
knew
0.71
Activations Density 1.541%