INDEX
    Explanations

    rules and restrictions

    The neuron detects tokens from a content‐policy refusal or “I’m sorry / I cannot fulfill this request” style apology/refusal statement.

    New Auto-Interp
    Negative Logits
     layui
    -0.07
     ration
    -0.07
     अध
    -0.07
    Curso
    -0.07
     lần
    -0.07
    uncture
    -0.07
    /gr
    -0.07
    _prime
    -0.06
     θέση
    -0.06
     مبت
    -0.06
    POSITIVE LOGITS
    ank
    0.07
     massa
    0.06
     selfie
    0.06
    UMAN
    0.06
    myModal
    0.06
    mor
    0.06
     Pussy
    0.06
    arlar
    0.05
    dataTable
    0.05
     mean
    0.05
    Act Density 0.003%

    No Known Activations