INDEX
Explanations
terms relating to legal or policy-related language
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
12
+0.18
1.1%
442
+0.14
0.8%
198
+0.13
0.8%
Correlated Neurons
Index
P. Corr.
Cos Sim.
12
+0.18
0.14
78
+0.14
0.05
455
+0.13
0.12
Negative Logits
½
-3.43
ľĵ
-3.27
ĺ
-3.25
§
-3.21
©
-3.20
ľ
-3.06
¡
-3.00
ı
-2.97
ĨĴ
-2.96
ĻĤ
-2.84
POSITIVE LOGITS
mattered
1.77
eries
1.60
ctrine
1.54
icable
1.53
issue
1.49
ricular
1.47
ifiable
1.43
asic
1.42
omy
1.39
etric
1.38
Activations Density 4.269%