INDEX
    Explanations

    acknowledging limitations or inability

    role markers that denote the start of the assistant/model’s turn in structured chat transcripts.

    New Auto-Interp
    Negative Logits
    ście
    0.40
    Salt
    0.39
    0.37
    ätte
    0.37
    驿
    0.36
     [&
    0.36
     Happens
    0.35
    0.35
    muş
    0.35
    0.34
    POSITIVE LOGITS
     regrett
    0.50
     regret
    0.45
     unfortunately
    0.45
     sorry
    0.43
     عدم
    0.43
     regretted
    0.42
     unable
    0.41
     inability
    0.41
     unavailable
    0.41
     scandalous
    0.40
    Act Density 0.031%

    No Known Activations