INDEX
Explanations
phrases indicating a recommended course of action or a comparison between different approaches or states
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
2034
+0.10
0.3%
270
+0.10
0.3%
765
+0.09
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1104
+0.10
0.03
1450
+0.10
0.02
270
+0.09
0.03
Negative Logits
territo
-1.00
tew
-1.00
excu
-0.99
profi
-0.96
dises
-0.96
rafra
-0.91
abnorm
-0.91
hina
-0.91
maksi
-0.91
„,
-0.90
POSITIVE LOGITS
собенности
0.66
фициальный
0.63
simply
0.62
lepiej
0.62
prostu
0.61
nowu
0.60
via
0.59
ypeł
0.56
фициаль
0.56
municipi
0.56
Activations Density 0.233%