INDEX
    Explanations

    negative content

    instances of a highly offensive racial slur (the n-word) and similar hateful/derogatory language.

    New Auto-Interp
    Negative Logits
    回复
    -0.07
    _picture
    -0.07
    Lie
    -0.07
    שמים
    -0.07
    Authorized
    -0.06
    -ul
    -0.06
    .Information
    -0.06
    主力军
    -0.06
    -0.06
    _shot
    -0.06
    POSITIVE LOGITS
     bindings
    0.07
    0.07
    unsafe
    0.06
     hấp
    0.06
    .Include
    0.06
    param
    0.06
     prést
    0.06
     ark
    0.06
     ASP
    0.06
    0.06
    Act Density 0.715%

    No Known Activations