INDEX
    Explanations

    instances of refusal or resistance to actions or decisions

    New Auto-Interp
    Negative Logits
     Zwe
    -0.17
    aley
    -0.16
    afe
    -0.16
    å¥ĩ
    -0.16
    _categorical
    -0.14
    ikel
    -0.14
    ijo
    -0.14
    üt
    -0.14
    kn
    -0.14
    inos
    -0.14
    POSITIVE LOGITS
     refusal
    0.19
     refuses
    0.18
     refused
    0.17
     refuse
    0.17
     refusing
    0.16
     insistence
    0.15
     Marr
    0.15
     ******************************************************************************↵
    0.15
    amina
    0.15
    Ĵáŀ
    0.15
    Act Density 0.197%

    No Known Activations