INDEX
Explanations
phrases related to intentions or actions of individuals
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
184
+0.20
0.6%
1842
+0.19
0.6%
674
+0.18
0.6%
Correlated Neurons
Index
P. Corr.
Cos Sim.
184
+0.20
0.03
605
+0.19
0.01
1419
+0.18
0.04
Negative Logits
depic
-1.15
encomp
-1.12
inev
-1.10
disagre
-1.09
apprehen
-1.05
„,
-1.05
increa
-1.04
Juf
-1.04
hcm
-1.04
alre
-1.03
POSITIVE LOGITS
himself
1.31
his
1.12
himself
1.08
Himself
0.90
his
0.90
he
0.75
His
0.74
His
0.72
seiner
0.72
seinem
0.70
Activations Density 0.543%