INDEX
Explanations
conjunctions that introduce contrasting information
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
50
+0.19
1.3%
555
+0.06
0.4%
1068
+0.06
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1068
+0.19
0.14
1276
+0.06
0.13
555
+0.06
0.13
Negative Logits
<bos>
-2.51
ⓧ
-1.20
-1.06
<?
-0.99
/**
-0.89
/*
-0.88
<?
-0.84
/***
-0.81
/*++
-0.80
public
-0.73
POSITIVE LOGITS
maneu
2.12
affor
1.94
impra
1.90
accla
1.88
increa
1.83
stockholm
1.79
shenan
1.76
scrat
1.75
reluct
1.74
disagre
1.74
Activations Density 0.315%