INDEX
Explanations
phrases indicating intent or desire, particularly involving "to" followed by verbs
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
263
+0.19
1.1%
478
+0.15
0.9%
320
+0.15
0.8%
Correlated Neurons
Index
P. Corr.
Cos Sim.
263
+0.19
0.03
292
+0.15
0.03
24
+0.15
0.03
Negative Logits
nikov
-1.69
anese
-1.61
steps
-1.58
steps
-1.54
Statutes
-1.51
teenth
-1.50
Steps
-1.49
ories
-1.49
âĢIJ
-1.49
coats
-1.49
POSITIVE LOGITS
treat
1.77
resume
1.74
restore
1.67
receive
1.65
guarantee
1.64
recreate
1.56
safely
1.54
ren
1.54
capture
1.52
remove
1.52
Activations Density 0.090%