INDEX
Explanations
phrases related to information accuracy and ethical considerations
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
674
+0.12
0.3%
1690
+0.09
0.3%
906
+0.08
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1690
+0.12
0.05
470
+0.09
0.04
1477
+0.08
0.05
Negative Logits
malheure
-0.99
unwarran
-0.96
Wtf
-0.94
shenan
-0.88
Ikr
-0.86
Yess
-0.84
Noice
-0.84
disagre
-0.84
Lmao
-0.83
effray
-0.83
POSITIVE LOGITS
<bos>
1.05
interesting
0.73
story
0.64
interesting
0.63
sworth
0.61
Himo
0.58
interes
0.58
fascinating
0.57
juicy
0.56
ContentValues
0.56
Activations Density 0.555%