INDEX
Explanations
mentions of unchecked situations or actions that may lead to escalation or negative consequences
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1363
+0.12
0.4%
1557
+0.10
0.3%
478
+0.09
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1363
+0.12
0.05
2042
+0.10
0.04
1557
+0.09
0.04
Negative Logits
compromising
-0.62
solicited
-0.61
intelligible
-0.58
sightly
-0.55
comfor
-0.52
djang
-0.51
Să
-0.50
ilever
-0.49
itinéraire
-0.49
lwjgl
-0.49
POSITIVE LOGITS
paff
0.95
territo
0.87
tramont
0.86
vns
0.85
meis
0.82
vnt
0.81
fuo
0.80
monaster
0.80
chèvre
0.80
fua
0.80
Activations Density 0.200%