INDEX
    Explanations

    instances of the word "refusal."

    New Auto-Interp
    Negative Logits
    ets
    -0.70
     beaut
    -0.69
    paces
    -0.68
    rients
    -0.67
    »Ĵ
    -0.67
     safely
    -0.66
     tuned
    -0.64
     located
    -0.64
    pixel
    -0.64
    Featured
    -0.64
    POSITIVE LOGITS
     refusal
    3.29
     unwillingness
    2.38
     reluctance
    2.21
     insistence
    2.19
     inability
    2.14
     rejection
    1.97
     willingness
    1.93
     failure
    1.78
     denial
    1.76
     dismissal
    1.64
    Act Density 0.033%

    No Known Activations