INDEX
Explanations
phrases related to discussion or debate
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
761
+0.09
0.3%
1253
+0.09
0.2%
764
+0.08
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1519
+0.09
0.04
761
+0.09
0.05
1612
+0.08
0.02
Negative Logits
inappro
-0.84
deff
-0.77
fuf
-0.77
increa
-0.76
purcha
-0.76
iirc
-0.76
berea
-0.75
attemp
-0.74
Lmao
-0.74
Wtf
-0.73
POSITIVE LOGITS
?}
1.26
?</
1.26
?
1.25
?
1.25
?”
1.24
?");
1.23
?’
1.23
؟
1.20
}?
1.18
?"
1.17
Activations Density 0.638%