INDEX
Explanations
sentences related to consequences of actions or decisions, particularly with a focus on potential severe outcomes
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
604
+0.10
0.3%
509
+0.09
0.2%
468
+0.08
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
716
+0.10
0.03
100
+0.09
0.04
1424
+0.08
0.04
Negative Logits
antik
-1.00
alkoh
-0.97
fers
-0.92
plak
-0.89
meis
-0.89
elek
-0.88
silikon
-0.87
kram
-0.86
ché
-0.85
lele
-0.84
POSITIVE LOGITS
fatalities
0.79
death
0.76
deaths
0.76
fatality
0.74
harm
0.69
irreversible
0.65
tragedy
0.65
bloodshed
0.61
destruction
0.60
tragic
0.59
Activations Density 0.618%