INDEX
Explanations
phrases related to justification or reasoning
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1950
+0.14
0.5%
144
+0.14
0.5%
605
+0.13
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1950
+0.14
0.03
699
+0.14
0.03
899
+0.13
0.03
Negative Logits
inev
-0.81
unlaw
-0.73
viciss
-0.73
lara
-0.71
accla
-0.71
campa
-0.70
berea
-0.70
guarante
-0.70
alre
-0.69
opel
-0.68
POSITIVE LOGITS
justify
1.13
justified
1.08
justify
1.07
justifies
1.03
justification
1.03
justific
0.89
Jus
0.84
Jus
0.82
justified
0.82
justifying
0.81
Activations Density 0.088%