INDEX
Explanations
words related to news events and possible controversial political statements
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
528
+0.10
0.3%
667
+0.10
0.3%
411
+0.09
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
667
+0.10
0.03
1140
+0.10
0.03
353
+0.09
0.03
Negative Logits
</h2>
-0.51
</h3>
-0.49
</strong>
-0.48
OW
-0.47
↵↵
-0.47
}{-0.47
.
-0.47
-
-0.46
0
-0.46
↵
-0.46
POSITIVE LOGITS
thut
1.15
Souha
1.13
fta
1.12
Juf
1.11
aen
1.10
Khart
1.08
fortn
1.07
dises
1.07
Adieu
1.07
squa
1.05
Activations Density 0.115%