INDEX
    Explanations

    references to AI assistants and large language models, especially self-referential descriptions of the model, tools, and benchmarks (often with dates or platform names)

    New Auto-Interp
    Negative Logits
    мето
    0.52
    сле
    0.48
     needlessly
    0.47
    counted
    0.46
     традиции
    0.45
    народ
    0.45
    ENSOR
    0.45
    कांची
    0.45
    из
    0.44
    ದ್ದರಿಂದ
    0.43
    POSITIVE LOGITS
     AI
    0.96
     ChatGPT
    0.94
     OpenAI
    0.92
     chatbot
    0.91
     GPT
    0.84
     conversational
    0.80
     chatbots
    0.76
     openai
    0.76
    openai
    0.73
    chatbot
    0.71
    Act Density 1.397%

    No Known Activations