INDEX
Explanations
comparisons using the word "like" to describe similarities between different things
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1053
+0.13
0.4%
1491
+0.11
0.3%
1101
+0.10
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1491
+0.13
0.04
554
+0.11
0.05
1101
+0.10
0.04
Negative Logits
unspeak
-0.56
indescri
-0.53
cuck
-0.51
beaut
-0.51
beaute
-0.51
kaç
-0.50
cushi
-0.50
Medea
-0.49
dateOfBirth
-0.47
czegóły
-0.47
POSITIVE LOGITS
like
0.80
LIKE
0.75
LIKE
0.73
nagu
0.73
affez
0.71
like
0.70
trover
0.70
mosso
0.67
jät
0.67
preghi
0.66
Activations Density 0.111%