INDEX
Explanations
phrases expressing criticism or negative judgment
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
605
+0.12
0.4%
1758
+0.11
0.4%
1262
+0.11
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1262
+0.12
0.04
1974
+0.11
0.04
1758
+0.11
0.05
Negative Logits
Voi
-0.57
Võ
-0.55
fince
-0.54
{$\-0.53
AppCompatTheme
-0.50
Hæ
-0.49
ecera
-0.49
whofe
-0.49
traverser
-0.49
zoll
-0.48
POSITIVE LOGITS
necessarily
0.84
necessarily
0.61
not
0.59
NOT
0.50
ļ
0.50
consultato
0.49
not
0.48
merely
0.48
unlike
0.48
necesariamente
0.48
Activations Density 0.099%