INDEX
    Explanations

    language related to denial and refusal

    New Auto-Interp
    Negative Logits
    ึ้น
    -0.58
    SpringRunner
    -0.57
     AssemblyProduct
    -0.56
    artament
    -0.52
     chng
    -0.52
     Mog
    -0.51
     Lleva
    -0.50
     Kasper
    -0.50
     zure
    -0.50
     tra
    -0.50
    POSITIVE LOGITS
     refusal
    1.56
     reject
    1.55
     rejection
    1.53
     refuse
    1.48
     Refuse
    1.48
     rejects
    1.47
     denied
    1.46
     rejected
    1.46
     rejecting
    1.45
     Reject
    1.45
    Act Density 0.271%

    No Known Activations