INDEX
    Explanations

    personal pronouns and affirmations in conversational contexts

    I/you followed by verbs

    tokens that mark the assistant/model's reply or role label (e.g., "Assistant", "Response", the colon after a role, or other assistant-turn markers).

    New Auto-Interp
    Negative Logits
     useAppContext
    -0.42
     HasFactory
    -0.40
    ljiv
    -0.40
    -0.40
    脚注の使い方
    -0.40
    tangentMode
    -0.39
    الإنجليزية
    -0.38
     tavo
    -0.38
     vuestro
    -0.38
    eterangan
    -0.38
    POSITIVE LOGITS
     unsafe
    0.46
    Safe
    0.44
    Saf
    0.43
     المعيارى
    0.43
    Safety
    0.42
     safer
    0.41
     SAFE
    0.41
    aarrggbb
    0.41
     Safer
    0.40
    cup
    0.40
    Act Density 0.009%

    No Known Activations