INDEX
Explanations
API-related calls and error messages
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
419
+0.20
1.1%
135
+0.18
1.0%
47
+0.15
0.9%
Correlated Neurons
Index
P. Corr.
Cos Sim.
135
+0.20
0.16
271
+0.18
-0.04
480
+0.15
0.09
Negative Logits
safety
-1.58
career
-1.47
oken
-1.36
Safety
-1.36
fulness
-1.36
iquit
-1.29
iels
-1.28
jeopardy
-1.28
immunity
-1.28
realism
-1.27
POSITIVE LOGITS
@
1.84
gets
1.68
#
1.60
#,
1.56
illary
1.56
cott
1.50
itte
1.50
brains
1.43
ubert
1.42
lette
1.42
Activations Density 4.325%