INDEX
Explanations
words related to surprises, shocks, or unexpected events
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
889
+0.21
0.8%
1047
+0.13
0.5%
680
+0.12
0.5%
Correlated Neurons
Index
P. Corr.
Cos Sim.
889
+0.21
0.03
1056
+0.13
0.03
1047
+0.12
0.03
Negative Logits
OGND
-0.55
الحره
-0.51
djangoproject
-0.48
➕
-0.47
Pautan
-0.47
resta
-0.45
apist
-0.45
Brett
-0.45
Teb
-0.44
Mull
-0.44
POSITIVE LOGITS
shock
1.39
Shock
1.25
SHOCK
1.20
shock
1.19
shocks
1.17
Shock
1.12
shocked
1.06
shocked
1.02
fuper
0.92
whofe
0.91
Activations Density 0.060%