INDEX
Explanations
mentions of specific character names
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1036
+0.10
0.3%
642
+0.09
0.3%
1984
+0.09
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
422
+0.10
0.04
1654
+0.09
0.05
1181
+0.09
0.05
Negative Logits
aen
-1.81
ftu
-1.70
mef
-1.69
fta
-1.69
fte
-1.66
„,
-1.64
sovere
-1.59
fatis
-1.59
wien
-1.56
thut
-1.55
POSITIVE LOGITS
'
0.85
’
0.84
who
0.71
‘
0.70
during
0.66
before
0.65
via
0.62
about
0.61
,
0.61
him
0.60
Activations Density 0.517%