INDEX
    Explanations

    the beginning of the assistant’s reply in a dialogue (the assistant turn marker or first token of the model’s message).

    New Auto-Interp
    Negative Logits
     regressions
    0.47
     opérations
    0.46
     വിൽപ്പന
    0.45
     variété
    0.45
     auteurs
    0.45
     radiographs
    0.45
    ÜR
    0.44
    0.44
     ফ্যাস
    0.43
     collaborateurs
    0.42
    POSITIVE LOGITS
     sorry
    0.69
    sorry
    0.69
     plz
    0.54
    Sorry
    0.49
     Sorry
    0.49
    hehe
    0.49
     pls
    0.47
    they
    0.47
    answer
    0.46
    you
    0.46
    Act Density 0.007%

    No Known Activations