INDEX
Explanations
adjectives related to weakness or vulnerability
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
347
+0.17
0.6%
1350
+0.14
0.5%
1677
+0.13
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
347
+0.17
0.03
1677
+0.14
0.02
1350
+0.13
0.02
Negative Logits
finn
-0.59
gild
-0.58
zyn
-0.58
inder
-0.58
oner
-0.58
lts
-0.58
?...
-0.57
Gies
-0.57
embra
-0.57
mme
-0.56
POSITIVE LOGITS
weak
1.25
weak
1.19
Weak
1.19
Weak
1.10
weakest
1.03
weaker
1.02
weaken
1.01
weakness
0.98
weakened
0.94
weakening
0.88
Activations Density 0.065%