INDEX
Explanations
descriptions of heroic acts and deeds, often in the context of challenging or dangerous situations
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
478
+0.10
0.3%
764
+0.09
0.3%
1473
+0.08
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1691
+0.10
0.04
478
+0.09
0.03
868
+0.08
0.03
Negative Logits
jati
-0.91
idr
-0.90
disagre
-0.89
nece
-0.87
haer
-0.85
meis
-0.85
inev
-0.84
nuoc
-0.84
gend
-0.84
karte
-0.84
POSITIVE LOGITS
heroism
0.82
bravery
0.80
heroic
0.78
courage
0.78
selfless
0.76
hero
0.74
courageous
0.71
heroes
0.71
dedication
0.68
admirable
0.65
Activations Density 0.797%