INDEX
Explanations
quotation marks at the beginning or end of phrases
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1047
+0.18
0.9%
1392
+0.18
0.9%
492
+0.15
0.8%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1047
+0.18
0.03
492
+0.18
0.03
478
+0.15
0.01
Negative Logits
<bos>
-1.72
/*!
-0.77
sog
-0.69
/***
-0.61
Stä
-0.59
Referencoj
-0.59
Ufer
-0.57
dras
-0.57
aggres
-0.57
nark
-0.55
POSITIVE LOGITS
nmax
0.97
Joaqu
0.78
unspeak
0.76
😭😭
0.73
ingrat
0.69
ajustable
0.68
impra
0.68
Mérida
0.68
ados
0.68
Cadiz
0.67
Activations Density 0.341%