INDEX
Explanations
negative and aggressive language, including death threats and hate-filled messages
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
776
+0.13
0.4%
50
+0.12
0.4%
678
+0.11
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
16
+0.13
0.07
382
+0.12
0.05
683
+0.11
0.06
Negative Logits
shenan
-1.03
hairc
-1.02
juges
-1.01
ecru
-1.01
négociations
-0.98
<bos>
-0.95
plais
-0.94
récompenses
-0.93
réunions
-0.93
vœux
-0.93
POSITIVE LOGITS
unexpected
0.61
discussions
0.61
occasional
0.59
eclamp
0.58
Palembang
0.57
wareness
0.57
intenance
0.56
caña
0.56
heridos
0.56
prayers
0.55
Activations Density 0.566%