INDEX
Explanations
negations or expressions of disapproval
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
50
+0.24
1.0%
1328
+0.13
0.5%
501
+0.12
0.5%
Correlated Neurons
Index
P. Corr.
Cos Sim.
501
+0.24
0.04
1328
+0.13
0.04
1352
+0.12
0.04
Negative Logits
<bos>
-1.92
intersper
-1.18
amass
-0.84
/***
-0.80
endow
-0.79
disarm
-0.77
rouse
-0.76
disambigu
-0.72
vanqu
-0.72
acquaint
-0.70
POSITIVE LOGITS
should
0.85
Should
0.82
Should
0.81
should
0.80
SHOULD
0.75
hould
0.69
shouldn
0.68
bekah
0.60
noten
0.60
cautionary
0.60
Activations Density 0.113%