INDEX
Explanations
expressions of gratitude
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1839
+0.13
0.4%
1557
+0.11
0.3%
1056
+0.11
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1557
+0.13
0.02
683
+0.11
0.02
1056
+0.11
0.02
Negative Logits
-0.62
zoll
-0.57
alfo
-0.54
habang
-0.54
confider
-0.52
bensin
-0.51
ftill
-0.51
isoli
-0.50
горит
-0.50
caufe
-0.49
POSITIVE LOGITS
thank
1.01
thank
0.98
Thank
0.97
Thank
0.95
THANK
0.93
THANK
0.85
thanking
0.74
thanks
0.74
Thanks
0.70
thanks
0.70
Activations Density 0.029%