INDEX
Explanations
phrases expressing personal beliefs and values
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
50
+0.19
0.7%
394
+0.19
0.6%
1177
+0.13
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
599
+0.19
0.15
1415
+0.19
0.06
468
+0.13
0.10
Negative Logits
susun
-0.73
sumpay
-0.65
marrone
-0.63
beginnetje
-0.62
дописавши
-0.62
Personensuche
-0.58
tanong
-0.58
silang
-0.58
trovo
-0.57
AddTagHelper
-0.57
POSITIVE LOGITS
”,
0.59
",
0.56
prerog
0.52
")
0.52
").
0.52
despotism
0.51
"
0.50
”)
0.50
"),
0.50
”
0.49
Activations Density 3.694%