INDEX
Explanations
personal pronouns referring to individuals or groups, as well as actions performed by those individuals or groups
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1482
+0.10
0.3%
1861
+0.10
0.3%
47
+0.10
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1861
+0.10
0.04
851
+0.10
0.03
47
+0.10
0.03
Negative Logits
increa
-1.60
reluct
-1.60
fuf
-1.56
apprehen
-1.53
disagre
-1.51
depic
-1.48
impra
-1.48
emphat
-1.46
accla
-1.43
suscep
-1.41
POSITIVE LOGITS
selves
0.87
herself
0.87
himself
0.87
self
0.85
themselves
0.85
yourself
0.85
Himself
0.82
ourselves
0.81
SELF
0.79
myself
0.77
Activations Density 0.110%