INDEX
Explanations
instances of the assistant's standard self‑referential safety/disclaimer phrasing (e.g., "I'm sorry, but as an AI language model...").
New Auto-Interp
Negative Logits
(job
-0.06
パ
-0.06
(state
-0.06
tạm
-0.06
unite
-0.06
(sample
-0.06
P
-0.06
/class
-0.06
GFP
-0.06
Рус
-0.06
POSITIVE LOGITS
AI
0.18
AI
0.11
_AI
0.08
Treatment
0.08
Ai
0.08
_build
0.07
ertainty
0.07
.AI
0.07
Ethernet
0.07
.ai
0.07
Activations Density 0.026%