INDEX
Explanations
text related to policy discussion with a focus on regulations and exceptions
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1131
+0.07
0.2%
307
+0.07
0.2%
2027
+0.07
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1580
+0.07
0.04
1847
+0.07
0.04
1383
+0.07
0.04
Negative Logits
depic
-0.63
fameux
-0.63
shenan
-0.62
Nicolai
-0.61
reluct
-0.60
intersper
-0.60
Bartholo
-0.60
McLaugh
-0.59
milf
-0.59
apprehen
-0.58
POSITIVE LOGITS
rather
1.82
rather
1.68
instead
1.49
instead
1.41
Rather
1.39
Rather
1.39
Instead
1.27
而不是
1.19
Instead
1.18
plutôt
1.13
Activations Density 0.686%