INDEX
Explanations
mentions of physical violence or aggressive actions
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1938
+0.07
0.2%
394
+0.07
0.2%
1398
+0.06
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1891
+0.07
0.02
236
+0.07
0.04
1812
+0.06
0.03
Negative Logits
coq
-0.82
stockholm
-0.82
purcha
-0.80
budapest
-0.78
increa
-0.78
wien
-0.76
lola
-0.75
sii
-0.74
fortn
-0.74
alre
-0.73
POSITIVE LOGITS
<bos>
0.72
forehead
0.52
twice
0.52
face
0.51
directo
0.47
somewhere
0.46
***!
0.45
ويد
0.44
head
0.44
shoulder
0.43
Activations Density 0.159%