INDEX
Explanations
phrases related to decision-making and self-improvement
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
604
+0.14
0.4%
1531
+0.09
0.3%
1698
+0.08
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1531
+0.14
0.04
1580
+0.09
0.03
2026
+0.08
0.03
Negative Logits
alkoh
-1.04
uhr
-1.00
kompati
-0.99
maksi
-0.97
keramik
-0.96
lele
-0.93
Kategor
-0.93
<bos>
-0.93
lemp
-0.93
antik
-0.91
POSITIVE LOGITS
unnecessary
0.69
harmful
0.62
unhealthy
0.62
wastes
0.61
unhelpful
0.59
wasted
0.58
pointless
0.58
ineffective
0.57
detrimental
0.57
unnecessarily
0.56
Activations Density 0.387%