INDEX
    Explanations

    negations and conditional phrases indicating refusal or limitations

    New Auto-Interp
    Negative Logits
    äl
    -0.16
    ndx
    -0.16
    reet
    -0.16
    zdy
    -0.15
    iên
    -0.15
    rar
    -0.14
    ripp
    -0.14
    iedade
    -0.14
    lew
    -0.14
    ntax
    -0.13
    POSITIVE LOGITS
     be
    0.17
     diá»ħn
    0.16
    iece
    0.16
    åĵ¡
    0.15
    quet
    0.14
    åijĺ
    0.14
     sul
    0.14
    rut
    0.14
    -linear
    0.14
    ogo
    0.14
    Act Density 0.074%

    No Known Activations