INDEX
Explanations
phrases implying contrasting viewpoints or actions
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
674
+0.08
0.2%
1728
+0.07
0.2%
1265
+0.07
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1265
+0.08
0.03
1424
+0.07
0.03
446
+0.07
0.03
Negative Logits
Juf
-0.97
aen
-0.94
NOO
-0.93
thut
-0.90
„,
-0.90
Hano
-0.90
ftu
-0.88
Febru
-0.88
ufe
-0.88
nomine
-0.88
POSITIVE LOGITS
<bos>
0.58
AccessorTable
0.53
desire
0.52
bidden
0.49
ември
0.49
USTAIN
0.48
vlieg
0.48
ValueStyle
0.48
AppRoutingModule
0.47
Necesito
0.46
Activations Density 0.146%