INDEX
Explanations
quotes starting with "I", especially reflections or explanations about decisions and experiences
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
674
+0.17
0.6%
341
+0.11
0.4%
1124
+0.11
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
341
+0.17
0.03
1124
+0.11
0.03
1085
+0.11
0.02
Negative Logits
acherous
-0.56
lemp
-0.55
colas
-0.51
kela
-0.50
thuy
-0.50
laci
-0.49
ginald
-0.48
örgy
-0.48
doğan
-0.47
bambu
-0.47
POSITIVE LOGITS
Weiter
0.58
bonté
0.57
polski
0.56
pères
0.56
Tja
0.55
calciatore
0.54
PON
0.54
citoyen
0.52
Obsah
0.52
Prí
0.52
Activations Density 0.042%