INDEX
Explanations
negative emotions or criticisms related to behaviors or policies
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1013
+0.11
0.3%
605
+0.10
0.3%
468
+0.10
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
2030
+0.11
0.05
468
+0.10
0.05
1997
+0.10
0.05
Negative Logits
Solución
-0.72
chiaramente
-0.70
Πηγές
-0.70
rendono
-0.69
DropColumn
-0.68
scoper
-0.68
apparti
-0.67
Explicación
-0.66
Nuorodos
-0.64
Voci
-0.63
POSITIVE LOGITS
rval
0.78
nmax
0.76
licious
0.76
maxSize
0.69
ly
0.69
newArr
0.69
withal
0.69
imageName
0.68
posX
0.68
startX
0.67
Activations Density 0.249%