INDEX
Explanations
terms related to sexual orientation, sexual harassment, sexual violence, and discrimination
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
50
+0.15
0.8%
1218
+0.12
0.7%
58
+0.11
0.6%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1218
+0.15
0.04
809
+0.12
0.03
58
+0.11
0.03
Negative Logits
<bos>
-3.20
ⓧ
-0.85
/**
-0.84
/*---
-0.80
/*++
-0.74
<?
-0.74
do
-0.72
raise
-0.69
addCriterion
-0.67
SourceChecksum
-0.67
POSITIVE LOGITS
ftu
1.82
sovere
1.78
stockholm
1.75
Juf
1.75
fta
1.72
Augu
1.70
thut
1.66
Intere
1.62
disagre
1.60
increa
1.60
Activations Density 0.091%