INDEX
Explanations
words related to disbelief, disgust, and disrespect
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
866
+0.14
0.5%
1535
+0.10
0.3%
501
+0.10
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
866
+0.14
0.03
765
+0.10
0.03
501
+0.10
0.03
Negative Logits
on
-0.71
to
-0.66
Rome
-0.63
in
-0.63
for
-0.62
-0.61
of
-0.60
has
-0.59
or
-0.59
,
-0.59
POSITIVE LOGITS
dises
1.75
fordable
1.47
fatis
1.37
hdi
1.37
fta
1.37
isuzu
1.33
ftu
1.32
dci
1.32
imbal
1.32
milano
1.31
Activations Density 0.045%