INDEX
Explanations
references to AI assistants and large language models, especially self-referential descriptions of the model, tools, and benchmarks (often with dates or platform names)
New Auto-Interp
Negative Logits
мето
0.52
сле
0.48
needlessly
0.47
counted
0.46
традиции
0.45
народ
0.45
ENSOR
0.45
कांची
0.45
из
0.44
ದ್ದರಿಂದ
0.43
POSITIVE LOGITS
AI
0.96
ChatGPT
0.94
OpenAI
0.92
chatbot
0.91
GPT
0.84
conversational
0.80
chatbots
0.76
openai
0.76
openai
0.73
chatbot
0.71
Activations Density 1.397%