INDEX
Explanations
mentions of names or identities in a text
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1150
+0.14
0.4%
1042
+0.11
0.3%
1309
+0.10
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1309
+0.14
0.04
270
+0.11
0.06
1114
+0.10
0.04
Negative Logits
„,
-0.96
inder
-0.96
?...
-0.93
effe
-0.92
»>
-0.82
§.
-0.81
desir
-0.81
uncin
-0.81
aen
-0.79
illi
-0.79
POSITIVE LOGITS
nor
1.33
nor
0.94
anymore
0.90
whatsoever
0.82
neither
0.81
sondern
0.80
Nor
0.78
Nor
0.75
unless
0.74
except
0.74
Activations Density 0.804%