INDEX
Explanations
references to the concept of "malice" or negative traits
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
795
+0.17
1.1%
50
+0.15
0.9%
971
+0.12
0.7%
Correlated Neurons
Index
P. Corr.
Cos Sim.
795
+0.17
0.03
1622
+0.15
0.02
1331
+0.12
0.02
Negative Logits
<bos>
-2.43
Географи
-0.70
setDo
-0.70
AutoScaleMode
-0.68
Петербург
-0.65
AppCompatTheme
-0.64
UseVisualStyle
-0.63
Hướng
-0.63
mergeFrom
-0.62
<tfoot>
-0.62
POSITIVE LOGITS
thut
1.72
ftu
1.62
perfon
1.60
effe
1.60
reft
1.59
wien
1.59
sovere
1.55
stockholm
1.53
Augu
1.52
§.
1.52
Activations Density 0.065%