INDEX
Explanations
text related to leaked information, particularly in a media or news context
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1150
+0.10
0.3%
335
+0.08
0.2%
919
+0.07
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1717
+0.10
0.05
782
+0.08
0.04
1696
+0.07
0.02
Negative Logits
disagre
-1.71
increa
-1.64
encomp
-1.61
emphat
-1.55
impra
-1.55
reluct
-1.52
depic
-1.50
affor
-1.48
milf
-1.48
shenan
-1.44
POSITIVE LOGITS
disclosure
0.86
public
0.84
leak
0.78
disclose
0.77
public
0.77
pub
0.76
disclosed
0.75
publication
0.75
publicly
0.74
gepubliceerd
0.72
Activations Density 0.467%