INDEX
Explanations
The neuron flags words appearing in legal non-discrimination statements, especially terms naming protected categories (e.g. sex, race, disability, etc.).
New Auto-Interp
Negative Logits
shl
-0.07
-
-0.07
'&&
-0.07
srd
-0.06
DAO
-0.06
acht
-0.06
RAL
-0.06
#####↵
-0.06
Between
-0.06
_Al
-0.06
POSITIVE LOGITS
alleging
0.06
_pedido
0.06
وئ
0.06
.control
0.06
Ph
0.06
توص
0.06
Ћ
0.06
ط
0.06
Boundary
0.06
-status
0.06
Activations Density 0.003%