INDEX
    Explanations

    explicit user task requests or questions, especially the concrete ask near the end of a user turn.

    New Auto-Interp
    Negative Logits
     prů
    0.46
     বাহিনী
    0.41
     arquivos
    0.41
     nouveaux
    0.40
     vaisseaux
    0.39
    ابط
    0.38
    规模
    0.38
     rutas
    0.38
    વાથી
    0.38
     alguns
    0.38
    POSITIVE LOGITS
    ?
    0.57
    0.56
    hint
    0.55
     Hint
    0.54
     Answer
    0.54
     आंसर
    0.50
    ↵↵
    0.49
    ?"
    0.47
     используя
    0.46
     Ans
    0.45
    Act Density 0.304%

    No Known Activations