INDEX
    Explanations

    The neuron detects words related to content‐use restrictions and copyright disclaimers (e.g., “published,” “broadcast,” “rewritten,” “redistributed”).

    New Auto-Interp
    Negative Logits
     Setup
    -0.07
    .Framework
    -0.07
    OCK
    -0.07
    abd
    -0.07
    .pth
    -0.06
    そんな
    -0.06
    ASCADE
    -0.06
     safer
    -0.06
     heal
    -0.06
     sopr
    -0.06
    POSITIVE LOGITS
     العالم
    0.07
     dish
    0.06
     řekl
    0.06
    omet
    0.06
    ...,
    0.06
     přih
    0.06
    vais
    0.06
     арти
    0.06
    terraform
    0.06
     arkadaş
    0.06
    Act Density 0.001%

    No Known Activations