INDEX
Explanations
Generic text
This neuron activates on the formal definition and instruction language used to specify sexual‐content policy (e.g. words like “Content,” “meant,” “arouse,” “excitement,” “such,” “description,” “excluding”).
New Auto-Interp
Negative Logits
Abyss
-0.07
vacant
-0.07
wreckage
-0.06
bliss
-0.06
expected
-0.06
ircuit
-0.06
gone
-0.06
obs
-0.06
_build
-0.06
Injection
-0.06
POSITIVE LOGITS
ANT
0.07
%=
0.06
kuş
0.06
เค
0.06
زیبا
0.06
工
0.06
кора
0.06
="")↵
0.06
United
0.06
başlayan
0.06
Activations Density 0.012%