INDEX
Explanations
sentences related to ethical values, social justice, and political commentary
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
80
+0.09
0.2%
872
+0.09
0.2%
1135
+0.08
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
382
+0.09
0.05
163
+0.09
0.02
1135
+0.08
0.03
Negative Logits
suspic
-1.50
excru
-1.47
reluct
-1.43
embra
-1.41
Perci
-1.40
inev
-1.38
accla
-1.37
impra
-1.37
compen
-1.36
increa
-1.34
POSITIVE LOGITS
ones
0.86
which
0.69
ones
0.66
Ones
0.63
whose
0.63
which
0.62
including
0.62
where
0.61
sahiptir
0.60
olyan
0.59
Activations Density 0.314%