INDEX
Explanations
phrases indicating uncertainty or reservation in a statement
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
381
+0.11
0.3%
1036
+0.10
0.3%
484
+0.10
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
603
+0.11
0.04
1590
+0.10
0.03
1269
+0.10
0.04
Negative Logits
fta
-1.30
secon
-1.29
oner
-1.29
aen
-1.28
emphat
-1.26
„,
-1.26
seiz
-1.26
?...
-1.21
hcm
-1.21
perfon
-1.21
POSITIVE LOGITS
digress
0.76
also
0.67
still
0.64
também
0.62
aren
0.59
isn
0.59
wasn
0.59
don
0.58
worse
0.57
didn
0.57
Activations Density 0.397%