INDEX
Explanations
phrases related to blame and responsibility
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
674
+0.10
0.3%
468
+0.08
0.2%
62
+0.08
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1919
+0.10
0.04
744
+0.08
0.03
847
+0.08
0.02
Negative Logits
uninten
-0.82
dilap
-0.82
seclu
-0.80
impra
-0.79
reluct
-0.76
resear
-0.76
unve
-0.75
Kün
-0.72
depic
-0.72
saar
-0.72
POSITIVE LOGITS
ciless
0.65
SharedDtor
0.50
mbad
0.50
pexpr
0.50
blame
0.49
AssertionError
0.49
eload
0.48
ündigt
0.48
utriche
0.48
somehow
0.47
Activations Density 0.329%