INDEX
Explanations
discriminatory language related to sexual orientation, gender identity, and civil rights
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
198
+0.09
0.3%
1842
+0.08
0.2%
1571
+0.08
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1571
+0.09
0.03
1818
+0.08
0.04
691
+0.08
0.01
Negative Logits
fays
-0.72
feen
-0.70
endom
-0.70
Juf
-0.67
fign
-0.66
Pfal
-0.63
Dés
-0.63
fua
-0.62
fince
-0.62
sonne
-0.61
POSITIVE LOGITS
<bos>
0.64
SneakyThrows
0.62
nationality
0.56
niająca
0.53
üedad
0.52
KELEY
0.49
Fitment
0.48
agences
0.48
pañas
0.48
Walkover
0.47
Activations Density 0.291%