This post isn’t about models or prompts. It’s about the things that kept breaking once AI moved off the happy path: async jobs, retries, silent failures, provider outages, cost blowups, and debugging without visibility.
I wrote this mostly as a way to document the mistakes I made and what I wish I had known earlier. Happy to answer questions or dig deeper into any of the failure modes.