INDEX
Explanations
phrases regarding permissible actions within a legal context
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
186
+0.13
0.8%
95
+0.13
0.8%
323
+0.12
0.7%
Correlated Neurons
Index
P. Corr.
Cos Sim.
323
+0.13
0.01
145
+0.13
0.01
95
+0.12
0.01
Negative Logits
ľ
-1.63
ĨĴ
-1.61
®
-1.59
ľĵ
-1.57
![**
-1.51
Ĺ
-1.50
İ
-1.48
'</
-1.43
Ļª
-1.43
arer
-1.42
POSITIVE LOGITS
alone
1.54
fulness
1.44
fields
1.41
feit
1.40
ng
1.36
country
1.32
debt
1.29
ourselves
1.29
domains
1.28
field
1.28
Activations Density 0.023%