INDEX
Explanations
proper nouns, particularly names of authors or book titles
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1150
+0.25
0.8%
1343
+0.17
0.5%
227
+0.13
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
981
+0.25
0.09
227
+0.17
0.08
1097
+0.13
0.07
Negative Logits
كومونز
-0.75
Poznám
-0.69
pueden
-0.65
ricev
-0.64
enzuela
-0.62
ModelExpression
-0.60
pinak
-0.59
después
-0.59
algunos
-0.57
himo
-0.57
POSITIVE LOGITS
subgoals
0.56
inappro
0.51
extré
0.51
sokak
0.51
célé
0.49
desnuda
0.49
Jr
0.49
Hitam
0.49
(@
0.48
vecteur
0.48
Activations Density 0.308%