INDEX
Explanations
phrases expressing societal expectations and personal responsibility
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
391
+0.16
0.9%
198
+0.16
0.9%
436
+0.13
0.8%
Correlated Neurons
Index
P. Corr.
Cos Sim.
391
+0.16
0.10
198
+0.16
0.09
326
+0.13
0.07
Negative Logits
egan
-1.69
hereby
-1.58
blogger
-1.46
owner
-1.43
daughter
-1.40
subscrib
-1.40
refund
-1.36
eca
-1.35
blog
-1.35
jointly
-1.34
POSITIVE LOGITS
bol
1.67
orts
1.62
knowledge
1.58
unity
1.50
sek
1.47
strength
1.45
lat
1.41
acet
1.40
resolution
1.39
bod
1.38
Activations Density 2.761%