INDEX
Explanations
The neuron activates on occurrences of the word “racist” (and closely related forms like “racism”).
New Auto-Interp
Negative Logits
hdl
-0.07
WAL
-0.07
Holl
-0.06
stranded
-0.06
bol
-0.06
_patches
-0.06
всп
-0.06
ateral
-0.06
getline
-0.06
CLA
-0.06
POSITIVE LOGITS
racism
0.10
racist
0.10
きな
0.07
reason
0.07
0.06
़ों
0.06
endorsed
0.06
carpet
0.06
irector
0.06
opening
0.06
Activations Density 0.005%