INDEX
Explanations
statements or phrases highlighting the consequences or ethical implications of actions
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1376
+0.13
0.4%
674
+0.12
0.4%
596
+0.11
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1047
+0.13
0.04
1376
+0.12
0.04
596
+0.11
0.04
Negative Logits
maroc
-0.69
thuy
-0.67
gmbh
-0.66
myn
-0.66
ria
-0.65
meis
-0.65
inder
-0.65
wien
-0.63
sena
-0.62
ambass
-0.62
POSITIVE LOGITS
so
0.86
so
0.67
So
0.66
paž
0.62
So
0.62
spesies
0.59
SO
0.58
zodat
0.57
Så
0.56
bzw
0.53
Activations Density 0.099%