INDEX
    Explanations

    refusal, denied, ignored

    New Auto-Interp
    Negative Logits
     Faster
    0.42
     Context
    0.41
     Sexy
    0.40
     Spa
    0.38
     Optimal
    0.38
    важа
    0.38
     Falling
    0.38
     Bewer
    0.38
     Expectations
    0.37
    0.37
    POSITIVE LOGITS
     আশ্বাস
    0.81
     refusal
    0.70
     refused
    0.69
     refus
    0.68
     refuses
    0.66
     refusing
    0.66
     shrugged
    0.66
     told
    0.64
    promised
    0.64
     refuse
    0.63
    Act Density 0.007%

    No Known Activations