INDEX
Explanations
language model identity
self-referential AI meta-discussion, i.e., passages where the assistant describes its identity as a language model, capabilities, limitations, safety policies, training, and operational context.
New Auto-Interp
Negative Logits
foolproof
0.80
ampionship
0.69
Tort
0.66
fateful
0.65
Chọn
0.63
Locks
0.63
Arrow
0.63
Trap
0.62
发生在
0.62
“
0.62
POSITIVE LOGITS
chatbot
0.90
agréable
0.82
informatique
0.81
openai
0.80
язы
0.80
OpenAI
0.80
logiciels
0.79
numérique
0.78
jazy
0.78
chatbots
0.78
Activations Density 0.115%