INDEX
Explanations
negations or phrases that indicate something is not true or not present
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
250
+0.13
0.7%
70
+0.11
0.6%
506
+0.11
0.6%
Correlated Neurons
Index
P. Corr.
Cos Sim.
250
+0.13
0.08
70
+0.11
0.07
506
+0.11
0.06
Negative Logits
ķ
-2.77
ļ
-2.74
±
-2.74
ĻĤ
-2.71
ĨĴ
-2.71
»
-2.71
Ĵ
-2.71
Ļª
-2.68
®
-2.67
ĸ´
-2.66
POSITIVE LOGITS
oriously
2.56
anymore
2.43
yet
2.24
necessarily
1.90
necessarily
1.90
enough
1.86
yet
1.81
allowed
1.81
tingham
1.73
etheless
1.73
Activations Density 0.328%