INDEX
Explanations
phrases indicating certainty or emphasis
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
971
+0.13
0.4%
1065
+0.13
0.4%
605
+0.12
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1065
+0.13
0.03
971
+0.13
0.03
446
+0.12
0.02
Negative Logits
🤣🤣
-0.71
😍😍
-0.71
🥲
-0.70
unil
-0.69
🥲
-0.69
calvin
-0.68
Simult
-0.68
🙃
-0.68
😭😭
-0.67
☺☺
-0.66
POSITIVE LOGITS
definitely
0.75
Definitely
0.74
<bos>
0.71
definitely
0.69
Definitely
0.68
definite
0.62
expandindo
0.61
definitiv
0.52
definite
0.50
definately
0.49
Activations Density 0.097%