INDEX
Explanations
phrases that emphasize contrasts or differences
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
421
+0.12
0.4%
605
+0.12
0.4%
11
+0.11
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
11
+0.12
0.03
781
+0.12
0.03
421
+0.11
0.03
Negative Logits
depic
-0.67
increa
-0.66
excru
-0.64
pourrais
-0.61
indestru
-0.58
Cuen
-0.58
prends
-0.56
inev
-0.56
guarante
-0.56
strick
-0.56
POSITIVE LOGITS
different
0.73
Different
0.70
Different
0.68
Hentet
0.68
different
0.66
DIFFERENT
0.65
كومونز
0.64
differently
0.61
richTextPanel
0.60
MessageOf
0.57
Activations Density 0.087%