INDEX
Explanations
the presence of specific words related to violent events and actions
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1385
+0.13
0.4%
946
+0.12
0.3%
906
+0.10
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
946
+0.13
0.06
857
+0.12
0.03
1352
+0.10
0.05
Negative Logits
soto
-0.63
plis
-0.63
ekos
-0.60
muna
-0.59
encre
-0.59
stopp
-0.57
cabrio
-0.57
habang
-0.56
Italijani
-0.56
pecuni
-0.56
POSITIVE LOGITS
écout
0.76
fameux
0.65
découv
0.61
évit
0.61
curieux
0.60
parlant
0.59
conçus
0.59
offrant
0.58
réal
0.58
rassemb
0.58
Activations Density 0.337%