INDEX
Explanations
phrases related to performing actions or asking questions about processes
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
284
+0.12
0.7%
395
+0.11
0.6%
320
+0.10
0.6%
Correlated Neurons
Index
P. Corr.
Cos Sim.
284
+0.12
0.04
33
+0.11
0.04
395
+0.10
0.04
Negative Logits
tolerance
-1.74
!\
-1.61
comments
-1.58
notice
-1.54
fine
-1.45
warnings
-1.45
dys
-1.37
writ
-1.36
apologies
-1.35
prejudice
-1.35
POSITIVE LOGITS
↵
2.79
↵
2.79
2.79
↵ âĢĥ
2.79
č↵
2.79
↵
2.79
2.79
2.79
↵↵
2.79
<|outofrange|>
2.79
Activations Density 0.319%