INDEX
Explanations
phrases related to violent attacks and responsibility claims
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
284
+0.10
0.3%
1150
+0.10
0.3%
1784
+0.08
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
284
+0.10
0.09
332
+0.10
0.05
247
+0.08
0.05
Negative Logits
?...
-1.83
emphat
-1.80
desir
-1.79
!...
-1.79
effe
-1.75
accla
-1.72
increa
-1.72
affor
-1.71
unden
-1.70
suscep
-1.69
POSITIVE LOGITS
.
0.83
while
0.76
after
0.75
;
0.74
but
0.74
。
0.73
although
0.73
when
0.73
for
0.71
because
0.68
Activations Density 0.483%