INDEX
Explanations
phrases indicating contrasts or comparisons
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
394
+0.11
0.3%
1385
+0.10
0.3%
16
+0.08
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
394
+0.11
0.04
3
+0.10
0.05
16
+0.08
0.05
Negative Logits
apprehen
-1.14
reluct
-1.12
disagre
-1.11
disgra
-1.11
gaily
-1.10
reconno
-1.02
vainly
-1.01
tolerably
-0.98
inconce
-0.97
accla
-0.97
POSITIVE LOGITS
теризу
0.64
EXPERIMENTS
0.50
rozco
0.50
abuelos
0.50
marginVertical
0.50
YOND
0.49
tifact
0.49
PLATES
0.48
Mə
0.48
Voltaje
0.48
Activations Density 0.310%