INDEX
Explanations
phrases indicating logic, rationality, or coherence
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
67
+0.12
0.4%
1757
+0.10
0.3%
75
+0.10
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
683
+0.12
0.03
1372
+0.10
0.02
75
+0.10
0.02
Negative Logits
?...
-1.05
emphat
-1.05
!...
-0.93
immen
-0.92
increa
-0.92
accla
-0.90
alre
-0.90
tantôt
-0.90
guarante
-0.89
Augu
-0.89
POSITIVE LOGITS
sense
0.98
sense
0.84
Sense
0.78
ensical
0.78
SENSE
0.72
Sense
0.69
sentido
0.65
sens
0.61
fony
0.59
logic
0.57
Activations Density 0.064%