INDEX
Explanations
phrases related to instructing actions or emphasizing consequences
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1415
+0.10
0.3%
1967
+0.10
0.3%
2019
+0.09
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1921
+0.10
0.05
1415
+0.10
0.03
1310
+0.09
0.03
Negative Logits
hairc
-1.45
matel
-1.38
milf
-1.38
!...
-1.37
perfet
-1.36
Cfr
-1.34
milano
-1.34
Juf
-1.32
?...
-1.32
exé
-1.32
POSITIVE LOGITS
realize
0.67
enjoy
0.67
look
0.66
become
0.65
introduce
0.64
try
0.64
make
0.64
tell
0.63
say
0.62
إذا
0.62
Activations Density 0.239%