INDEX
    Explanations

    censored profanity

    this neuron detects profanity or expletive fragments (e.g. censored swear‐word symbols).

    New Auto-Interp
    Negative Logits
    abant
    -0.06
    ("
    -0.06
    783
    -0.06
    Noise
    -0.06
     baths
    -0.06
    .untracked
    -0.06
    ces
    -0.06
     JK
    -0.06
     Lincoln
    -0.06
     ÜNİVERS
    -0.06
    POSITIVE LOGITS
    picked
    0.07
    $time
    0.06
     tháng
    0.06
    Changed
    0.06
    .ViewModels
    0.06
    ومات
    0.06
     Wax
    0.06
     прав
    0.06
    _sd
    0.06
    -degree
    0.06
    Act Density 0.005%

    No Known Activations