INDEX
Explanations
terms related to additional or supplementary aspects
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
757
+0.14
0.5%
131
+0.13
0.4%
1512
+0.11
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
757
+0.14
0.03
1512
+0.13
0.03
11
+0.11
0.03
Negative Logits
Viene
-0.68
unspeak
-0.67
Lmao
-0.64
affor
-0.62
Fuckin
-0.60
indescri
-0.60
Wtf
-0.57
imprimer
-0.56
Adorable
-0.56
Chapitre
-0.55
POSITIVE LOGITS
Extra
1.08
extra
1.08
EXTRA
1.08
extra
1.05
Extra
1.03
EXTRA
0.99
ekstra
0.90
extras
0.89
extras
0.83
xtra
0.77
Activations Density 0.063%