INDEX
    Explanations

    This neuron fires on requests for “dirty talk,” specifically spotting the word “dirty” when the user asks the model to talk dirty.

    New Auto-Interp
    Negative Logits
    /scripts
    -0.06
     Meadows
    -0.06
    .generic
    -0.06
     mechanic
    -0.06
    :::/
    -0.06
     nach
    -0.06
    .maximum
    -0.06
     Wheeler
    -0.06
     zengin
    -0.06
    -0.06
    POSITIVE LOGITS
     Lease
    0.07
    важа
    0.06
     Dro
    0.06
     SNAP
    0.06
    PROJECT
    0.06
    ambil
    0.06
     offsets
    0.06
     Case
    0.06
     aşam
    0.06
     FLOAT
    0.06
    Act Density 0.027%

    No Known Activations