INDEX
Explanations
negations and refusals in text
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1328
+0.12
0.5%
950
+0.11
0.4%
506
+0.11
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
208
+0.12
0.07
950
+0.11
0.07
438
+0.11
0.06
Negative Logits
lele
-0.80
pommes
-0.74
Nguy
-0.72
magazin
-0.70
rong
-0.69
vian
-0.67
pama
-0.66
pipa
-0.66
adal
-0.65
Chinois
-0.65
POSITIVE LOGITS
shenan
0.70
Fuckin
0.67
Bullshit
0.66
necessarily
0.66
philanth
0.66
FTFY
0.63
Cringe
0.62
Ehh
0.62
unspeak
0.61
desertcart
0.61
Activations Density 0.182%