INDEX
Explanations
terms related to security measures and potential vulnerabilities
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1908
+0.08
0.2%
1511
+0.08
0.2%
391
+0.07
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
330
+0.08
0.03
208
+0.08
0.04
1908
+0.07
0.03
Negative Logits
apprehen
-1.78
disagre
-1.58
reluct
-1.54
accla
-1.47
Juf
-1.45
unspeak
-1.44
shenan
-1.42
increa
-1.41
encomp
-1.41
reconno
-1.39
POSITIVE LOGITS
lujo
0.65
afford
0.62
anymore
0.62
AnimationsModule
0.56
luxury
0.55
LIABLE
0.54
HideFlags
0.54
ménages
0.52
lose
0.51
darf
0.51
Activations Density 0.191%