INDEX
Explanations
assertions or claims about truth and certainty
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
50
+0.22
0.9%
1622
+0.09
0.4%
1793
+0.09
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1793
+0.22
0.08
100
+0.09
0.08
1601
+0.09
0.08
Negative Logits
<bos>
-2.47
<?
-0.85
-0.83
<?
-0.82
/***
-0.81
ⓧ
-0.80
disbur
-0.73
defray
-0.70
assiste
-0.66
endow
-0.65
POSITIVE LOGITS
pylab
0.83
číta
0.78
heapq
0.68
jectures
0.68
:]:
0.67
cristo
0.66
ados
0.65
irvana
0.65
functools
0.65
pymysql
0.64
Activations Density 0.655%