INDEX
Explanations
phrases related to authority figures making statements or giving instructions
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1056
+0.10
0.3%
1839
+0.09
0.3%
674
+0.08
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1056
+0.10
0.05
780
+0.09
0.03
1055
+0.08
0.04
Negative Logits
cahier
-0.86
cannes
-0.85
peculi
-0.84
emphat
-0.81
Huhu
-0.80
agi
-0.80
fte
-0.80
sembl
-0.79
velours
-0.78
bourg
-0.77
POSITIVE LOGITS
told
0.78
tell
0.75
how
0.71
about
0.70
tells
0.68
telling
0.63
tell
0.63
what
0.62
Told
0.61
<bos>
0.61
Activations Density 0.140%