INDEX
Explanations
terms and conditions or legal language in documents
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
50
+0.28
1.1%
453
+0.14
0.5%
478
+0.06
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
453
+0.28
0.08
1832
+0.14
0.05
1419
+0.06
0.05
Negative Logits
<bos>
-1.99
started
-0.69
got
-0.64
went
-0.64
came
-0.64
seemed
-0.63
,
-0.63
wanted
-0.62
began
-0.62
helped
-0.62
POSITIVE LOGITS
lele
1.56
vasi
1.54
stockholm
1.52
seksi
1.52
wien
1.52
cabrio
1.50
saar
1.49
maroc
1.48
socie
1.47
„,
1.46
Activations Density 0.570%