INDEX
Explanations
phrases related to workplace safety and communication
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
2034
+0.20
0.7%
674
+0.17
0.5%
478
+0.16
0.5%
Correlated Neurons
Index
P. Corr.
Cos Sim.
610
+0.20
0.06
1128
+0.17
0.05
478
+0.16
0.04
Negative Logits
shenan
-1.65
unspeak
-1.53
disagre
-1.49
maneu
-1.47
hairc
-1.45
unwarran
-1.45
horrend
-1.42
apprehen
-1.40
affor
-1.39
impra
-1.39
POSITIVE LOGITS
<bos>
1.23
***!
0.82
↵↵
0.81
']."
0.78
↵↵↵
0.77
<eos>
0.77
__))
0.76
ferrer
0.75
}}
0.74
.”
0.74
Activations Density 0.151%