INDEX
Explanations
phrases related to potential dangers, risks, and catastrophic events
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1438
+0.12
0.3%
1013
+0.10
0.3%
203
+0.07
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
509
+0.12
0.07
284
+0.10
0.06
1438
+0.07
0.04
Negative Logits
thut
-1.93
aen
-1.93
effe
-1.93
nece
-1.85
fte
-1.84
fta
-1.82
„,
-1.81
?...
-1.76
fep
-1.76
meis
-1.76
POSITIVE LOGITS
if
0.80
due
0.72
or
0.67
.
0.66
unless
0.66
if
0.65
because
0.63
<bos>
0.63
roasted
0.63
due
0.62
Activations Density 0.508%