INDEX
Explanations
mentions of corruption and unethical behavior, especially in public and political contexts
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
889
+0.14
0.5%
1233
+0.11
0.4%
1026
+0.10
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
889
+0.14
0.02
1233
+0.11
0.02
976
+0.10
0.02
Negative Logits
volunte
-0.98
guarante
-0.94
affor
-0.92
impra
-0.92
increa
-0.91
fortn
-0.90
effe
-0.89
reluct
-0.87
tanga
-0.86
sovere
-0.86
POSITIVE LOGITS
corruption
1.22
corruption
1.12
corrupt
1.06
Corruption
1.04
Corruption
1.04
corrupted
0.83
Cor
0.82
Cor
0.73
COR
0.72
corrup
0.70
Activations Density 0.060%