INDEX
    Explanations

    model response introduction

    markers indicating the start of the model/assistant’s response or AI-generated content within a dialogue structure.

    New Auto-Interp
    Negative Logits
    reuses
    0.37
     Proportion
    0.32
    ್ರ
    0.30
    ietta
    0.30
    िल्ली
    0.30
    0.30
    isements
    0.29
    ácil
    0.29
    ighi
    0.29
    itia
    0.29
    POSITIVE LOGITS
    <h1>
    0.44
    आपने
    0.42
     fascinating
    0.41
    Okay
    0.39
    ##
    0.39
    Você
    0.38
     термин
    0.38
    #
    0.38
    Sounds
    0.37
    你想
    0.37
    Act Density 0.074%

    No Known Activations