INDEX
Explanations
phrases emphasizing the importance of certain actions or concepts
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1589
+0.09
0.2%
270
+0.09
0.2%
1372
+0.09
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
270
+0.09
0.04
1138
+0.09
0.02
1839
+0.09
0.04
Negative Logits
unlaw
-0.87
?...
-0.87
!...
-0.81
accla
-0.80
invin
-0.76
hentai
-0.76
„,
-0.76
pubg
-0.75
milf
-0.75
depic
-0.75
POSITIVE LOGITS
Билгалдахарш
0.52
expandindo
0.50
kleber
0.49
StructEnd
0.48
browserify
0.47
aspect
0.46
glMatrixMode
0.46
之一
0.46
PyExc
0.45
ever
0.44
Activations Density 0.270%