INDEX
Explanations
phrases related to knowledge or stating facts
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1757
+0.12
0.4%
1527
+0.11
0.3%
404
+0.10
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
404
+0.12
0.03
2000
+0.11
0.03
1757
+0.10
0.03
Negative Logits
optik
-0.69
kapital
-0.63
kristal
-0.60
silikon
-0.60
adal
-0.60
etik
-0.58
keramik
-0.58
ekster
-0.56
kilomet
-0.55
alkoh
-0.55
POSITIVE LOGITS
disreg
0.89
know
0.81
shenan
0.80
unspeak
0.80
know
0.78
KNOW
0.76
knows
0.72
quivering
0.71
impra
0.70
Know
0.69
Activations Density 0.074%