INDEX
Explanations
mentions of crimes and justice-related scenarios
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
2034
+0.11
0.3%
1842
+0.09
0.3%
1042
+0.09
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
62
+0.11
0.04
490
+0.09
0.03
1882
+0.09
0.05
Negative Logits
dises
-1.16
grati
-1.10
sappi
-1.09
magis
-1.06
sopr
-1.04
igno
-1.04
sii
-1.00
dissi
-0.99
Chá
-0.99
illi
-0.98
POSITIVE LOGITS
then
0.77
whether
0.74
chances
0.73
unless
0.70
if
0.69
whether
0.68
you
0.68
suddenly
0.66
then
0.65
Then
0.65
Activations Density 0.408%