INDEX
Explanations
phrases expressing requests or demands
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
161
+0.14
0.4%
405
+0.11
0.3%
1993
+0.10
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
161
+0.14
0.05
405
+0.11
0.03
303
+0.10
0.03
Negative Logits
Lma
-0.92
FTFY
-0.91
invin
-0.90
guarante
-0.90
encomp
-0.89
YMMV
-0.87
alre
-0.87
Lmao
-0.86
scrat
-0.86
affor
-0.85
POSITIVE LOGITS
that
0.60
that
0.56
dass
0.55
Aholisi
0.52
UnusedPrivate
0.52
THAT
0.51
Roskov
0.51
ویکیپدی
0.50
rằng
0.50
everyone
0.49
Activations Density 0.186%