INDEX
Explanations
personal pronouns referring to a specific individual
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1978
+0.17
0.5%
381
+0.16
0.5%
1919
+0.15
0.5%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1919
+0.17
0.09
1510
+0.16
0.06
1637
+0.15
0.05
Negative Logits
olx
-1.24
levis
-1.22
budapest
-1.19
fatis
-1.18
magis
-1.16
umo
-1.16
Juf
-1.14
tanong
-1.14
wien
-1.13
lele
-1.12
POSITIVE LOGITS
He
0.87
He
0.86
he
0.85
he
0.77
didn
0.75
himself
0.74
did
0.72
She
0.71
hasn
0.70
She
0.69
Activations Density 0.282%