INDEX
Explanations
phrases indicating importance or emphasis
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1573
+0.13
0.5%
1265
+0.10
0.4%
1793
+0.10
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1573
+0.13
0.03
1793
+0.10
0.03
1370
+0.10
0.03
Negative Logits
préfère
-0.61
déclare
-0.57
Produzione
-0.57
Oster
-0.49
prouve
-0.49
considère
-0.49
défend
-0.47
Πηγές
-0.47
prends
-0.47
lega
-0.46
POSITIVE LOGITS
note
1.02
noted
1.01
notes
1.01
noted
0.99
noting
0.97
note
0.89
notes
0.84
Note
0.84
Notes
0.83
NOTES
0.82
Activations Density 0.050%