INDEX
Explanations
the neuron activates on single-word race labels (like “black” or “white”), detecting mentions of a person’s race.
New Auto-Interp
Negative Logits
shiny
-0.07
refr
-0.07
修改
-0.07
exponential
-0.07
拟
-0.07
пл
-0.06
reconnect
-0.06
ragazze
-0.06
َع
-0.06
�
-0.06
POSITIVE LOGITS
coupon
0.07
WSC
0.06
locker
0.06
.stat
0.06
tag
0.06
=self
0.06
dım
0.06
timeline
0.06
-player
0.06
полит
0.06
Activations Density 0.024%