INDEX
    Explanations

    mentions of the assistant role/label, especially in chat headers or references to the assistant in text.

    New Auto-Interp
    Negative Logits
     acids
    -0.07
     epidemi
    -0.07
    ूक
    -0.07
     damned
    -0.07
     Their
    -0.07
    -Regular
    -0.06
     gravel
    -0.06
     mitigation
    -0.06
    _ll
    -0.06
    $res
    -0.06
    POSITIVE LOGITS
    _INLINE
    0.07
    ัพย
    0.06
    παίδ
    0.06
    roat
    0.06
    article
    0.06
    ้ม
    0.06
     인터
    0.06
    _guard
    0.05
    _BASIC
    0.05
    :numel
    0.05
    Act Density 0.028%

    No Known Activations