INDEX
Explanations
phrases expressing clarity or definiteness
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1839
+0.12
0.4%
1435
+0.11
0.4%
938
+0.11
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1435
+0.12
0.05
1506
+0.11
0.04
1839
+0.11
0.04
Negative Logits
apprehen
-1.34
excru
-1.33
?...
-1.27
impra
-1.27
reluct
-1.25
gaily
-1.25
unspeak
-1.24
indestru
-1.22
disagre
-1.22
!...
-1.21
POSITIVE LOGITS
clear
1.31
clear
1.29
Clear
1.18
Clear
1.15
CLEAR
1.03
CLEAR
1.03
clarity
0.96
clears
0.94
clearer
0.94
cleared
0.89
Activations Density 0.095%