INDEX
Explanations
statements expressing extreme prejudice or discrimination
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1510
+0.15
0.4%
1343
+0.11
0.3%
998
+0.10
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1510
+0.15
0.05
1783
+0.11
0.05
1842
+0.10
0.04
Negative Logits
reluct
-2.16
encomp
-2.16
increa
-2.13
guarante
-2.06
fuf
-2.02
volunte
-2.01
inev
-2.00
embra
-1.99
depic
-1.97
emphat
-1.90
POSITIVE LOGITS
etc
0.99
…
0.96
...
0.95
.
0.88
!
0.86
;
0.85
,
0.85
....
0.84
too
0.84
?
0.82
Activations Density 0.391%