INDEX
Explanations
adjectives expressing evaluation or suitability
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1013
+0.11
0.3%
2034
+0.10
0.3%
605
+0.09
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
2030
+0.11
0.06
1997
+0.10
0.06
468
+0.09
0.06
Negative Logits
emphat
-1.49
!...
-1.42
?...
-1.37
fuf
-1.34
increa
-1.33
indestru
-1.33
desir
-1.31
accla
-1.30
nece
-1.25
suspic
-1.24
POSITIVE LOGITS
.
0.69
enough
0.68
;
0.67
?
0.62
for
0.61
,
0.60
:
0.59
。
0.58
in
0.58
!
0.58
Activations Density 0.412%