INDEX
Explanations
phrases related to speaking up or speaking out on various issues or behalf of others
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
791
+0.12
0.3%
47
+0.10
0.3%
1604
+0.09
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
791
+0.12
0.05
47
+0.10
0.04
538
+0.09
0.04
Negative Logits
depic
-0.95
guarante
-0.90
?...
-0.86
accla
-0.86
encomp
-0.85
increa
-0.83
desir
-0.80
fta
-0.80
fuf
-0.80
»>
-0.79
POSITIVE LOGITS
speak
0.74
loud
0.70
louder
0.70
spoken
0.69
voice
0.67
mouth
0.63
speaking
0.62
voices
0.61
spoken
0.61
aloud
0.60
Activations Density 0.446%