INDEX
Explanations
moderators
The neuron fires on content‐moderation warning text—particularly on fragments of the word “moderation” in the “YOUR INPUT VIOLATES OUR CONTENT MODERATION GUIDELINES” message.
New Auto-Interp
Negative Logits
uste
-0.07
xaf
-0.06
ến
-0.06
らし
-0.06
ultimately
-0.06
.shapes
-0.06
Leah
-0.06
らい
-0.06
trabaj
-0.06
meyi
-0.06
POSITIVE LOGITS
इ
0.07
($(
0.07
sectarian
0.06
,'
0.06
witnessed
0.06
vowed
0.06
demo
0.06
Fucking
0.06
case
0.06
.ImageIcon
0.06
Activations Density 0.001%