INDEX
Explanations
mentions of LGBTQ+ related terms and discrimination issues
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
468
+0.11
0.3%
513
+0.10
0.3%
1843
+0.09
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
919
+0.11
0.02
513
+0.10
0.03
1241
+0.09
0.04
Negative Logits
corrom
-0.54
huma
-0.49
dépens
-0.48
engend
-0.48
peup
-0.47
CodedInputStream
-0.46
vulga
-0.45
dépass
-0.45
répand
-0.45
palab
-0.44
POSITIVE LOGITS
trecut
0.58
dă
0.54
disambiguazione
0.52
Audiodateien
0.51
blest
0.51
SBATCH
0.51
üedad
0.50
împre
0.49
jectures
0.49
disqual
0.49
Activations Density 0.394%