INDEX
Explanations
text related to apologies and blame-shifting
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1317
+0.07
0.2%
925
+0.07
0.2%
1938
+0.07
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1919
+0.07
0.05
714
+0.07
0.04
1601
+0.07
0.04
Negative Logits
unwarran
-1.03
swarovski
-1.00
unlaw
-0.95
disagre
-0.95
repug
-0.93
embodi
-0.90
oleo
-0.88
impractica
-0.86
liberality
-0.85
excru
-0.84
POSITIVE LOGITS
mistake
0.81
mistakes
0.78
wrong
0.69
hindsight
0.64
apologize
0.63
erred
0.62
regrets
0.60
regret
0.59
apologise
0.59
missed
0.58
Activations Density 0.451%