INDEX
Explanations
phrases related to official statements, actions, or events
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1921
+0.15
0.5%
553
+0.12
0.4%
486
+0.11
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1921
+0.15
0.10
658
+0.12
0.08
553
+0.11
0.07
Negative Logits
unspeak
-0.76
maneu
-0.69
FFFF
-0.65
outlander
-0.64
impra
-0.63
ACKNOWLEDGMENTS
-0.63
snoopy
-0.63
disagre
-0.61
vincent
-0.61
indescri
-0.61
POSITIVE LOGITS
been
0.87
BEEN
0.79
Been
0.79
kayo
0.76
intende
0.76
been
0.74
\%$\\
0.65
ISHOP
0.64
Muhamma
0.64
Faites
0.63
Activations Density 0.207%