INDEX
Explanations
phrases related to personal opinions or evaluations
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
674
+0.09
0.3%
604
+0.08
0.2%
47
+0.08
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
689
+0.09
0.04
390
+0.08
0.04
169
+0.08
0.03
Negative Logits
antem
-0.97
aton
-0.97
bandung
-0.96
lele
-0.94
Minang
-0.93
gend
-0.90
accla
-0.90
depic
-0.90
territo
-0.90
mef
-0.89
POSITIVE LOGITS
includes
0.66
brings
0.59
thats
0.59
why
0.58
makes
0.58
reflects
0.57
happens
0.56
happened
0.55
allows
0.55
requires
0.54
Activations Density 0.138%