INDEX
Explanations
statements of criticism or commentary towards public figures or social issues
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
453
+0.17
0.5%
1343
+0.14
0.4%
1978
+0.14
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
321
+0.17
0.03
1803
+0.14
0.03
1135
+0.14
0.02
Negative Logits
unspeak
-1.84
intersper
-1.83
increa
-1.83
snoopy
-1.80
fta
-1.79
thut
-1.78
ftu
-1.78
tolerably
-1.75
gaily
-1.74
apprehen
-1.72
POSITIVE LOGITS
FlatAppearance
0.86
IntoConstraints
0.72
NOPQRST
0.71
DataPropertyName
0.70
Tē
0.67
Producción
0.67
Opere
0.67
FlatStyle
0.67
dymyr
0.66
imageio
0.65
Activations Density 0.030%