INDEX
    Explanations

    This neuron detects profanity-laden calls to ignore or break the rules (e.g., “bullshit”/“fuckin’” style exhortations to flout the policy).

    New Auto-Interp
    Negative Logits
    rowning
    -0.07
     glance
    -0.06
     IRS
    -0.06
     things
    -0.06
    -tracking
    -0.06
     Utility
    -0.06
     surfaces
    -0.06
     pools
    -0.06
     worsening
    -0.06
     turmoil
    -0.06
    POSITIVE LOGITS
     dbName
    0.08
    ortho
    0.07
    agra
    0.07
    .office
    0.07
     Waiting
    0.07
    cycl
    0.06
    /Common
    0.06
    ือถ
    0.06
    .wikipedia
    0.06
     EntityManager
    0.06
    Act Density 0.002%

    No Known Activations