INDEX
Explanations
details of violent incidents
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
50
+0.14
0.5%
1842
+0.13
0.5%
766
+0.12
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
184
+0.14
0.01
137
+0.13
0.05
1842
+0.12
0.03
Negative Logits
GEBURTSDATUM
-0.64
himo
-0.62
awtextra
-0.60
Personendaten
-0.60
intios
-0.59
ConstraintMaker
-0.57
Geplaatst
-0.57
انيف
-0.56
AddTagHelper
-0.55
存于互联网档案馆
-0.55
POSITIVE LOGITS
reluct
1.05
impra
0.97
philanth
0.95
disagre
0.92
impractica
0.92
Confu
0.92
wherea
0.92
unlaw
0.91
ineffec
0.91
inappro
0.90
Activations Density 0.521%