INDEX
Explanations
phrases related to correcting errors or addressing issues
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
468
+0.13
0.4%
604
+0.12
0.3%
80
+0.09
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
468
+0.13
0.05
1322
+0.12
0.03
1360
+0.09
0.03
Negative Logits
!...
-0.87
unden
-0.86
<bos>
-0.85
fluo
-0.80
fers
-0.79
attemp
-0.79
wherea
-0.78
compen
-0.78
↵↵↵↵↵↵↵↵↵↵↵↵↵↵↵↵↵↵↵↵↵↵↵↵↵↵↵↵↵↵
-0.77
honn
-0.76
POSITIVE LOGITS
problems
0.80
problem
0.74
issues
0.69
shortcomings
0.66
deficiencies
0.64
problems
0.63
gaps
0.62
wrongs
0.60
imbalances
0.60
deficits
0.60
Activations Density 0.356%