INDEX
Explanations
questions involving comparison and moral judgment
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1438
+0.10
0.3%
1150
+0.10
0.3%
1013
+0.08
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1411
+0.10
0.03
1438
+0.10
0.03
832
+0.08
0.04
Negative Logits
aussitôt
-0.69
%\[
-0.66
proprement
-0.66
volon
-0.63
quelles
-0.61
quoique
-0.61
librement
-0.59
útil
-0.58
destinées
-0.57
quelquefois
-0.57
POSITIVE LOGITS
.
0.77
!
0.71
indescri
0.65
;
0.64
disreg
0.64
unspeak
0.63
suscep
0.62
.\\
0.62
.;
0.61
。
0.60
Activations Density 0.240%