INDEX
Explanations
phrases that indicate change or transformation
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
358
+0.14
0.8%
470
+0.13
0.7%
376
+0.12
0.7%
Correlated Neurons
Index
P. Corr.
Cos Sim.
239
+0.14
0.04
62
+0.13
0.03
196
+0.12
0.03
Negative Logits
ĻĤ
-3.10
¿
-2.76
Ļ
-2.58
į
-2.57
ĥ
-2.55
ľĵ
-2.45
Īĺ
-2.38
©
-2.35
ľ
-2.34
ĺ
-2.31
POSITIVE LOGITS
tons
1.70
routines
1.65
matters
1.52
regimes
1.48
wording
1.47
behaviour
1.46
habits
1.41
gears
1.40
pone
1.40
verted
1.40
Activations Density 0.159%