INDEX
Explanations
words related to negative events, disasters, and challenges
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1705
+0.12
0.3%
674
+0.09
0.3%
973
+0.08
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1705
+0.12
0.06
1852
+0.09
0.04
1746
+0.08
0.04
Negative Logits
sonder
-0.60
hever
-0.55
pfe
-0.54
elee
-0.54
wille
-0.52
individuel
-0.52
tothe
-0.50
heyd
-0.50
Neub
-0.50
esss
-0.49
POSITIVE LOGITS
ⓧ
0.66
tetrach
0.65
popoli
0.65
kasama
0.64
ffilm
0.63
affatto
0.62
venuto
0.62
<bos>
0.61
kanya
0.61
papà
0.59
Activations Density 0.390%