Maybe this weekend I'll finally get the energy up to just do it.
> such that all online caches get updated
There's no such thing. Apart from millions of dedicated caching servers, each end device will have it's own cache. You can't invalidate DNS entries at that scope.
It's "common" to lower a TTL in preparation for a change to an existing RR, but you need to make sure you lower it at least as long as the current TTL prior to the change. Keeping the TTL low after the change isn't beneficial unless you're planning for the possibility of reverting the change.
A low TTL on a new record will not speed propagation. Resolvers either have the new record cached or they don't. If it's cached, the TTL doesn't matter because it already has the record (propogated). If it doesn't have it cached, then it doesn't know the TTL so doesn't matter if it's 1 second or 1 month.
And a similar version of the same blog post on a personal blog in 2019 https://news.ycombinator.com/item?id=21436448 (thanks to ChrisArchitect for noting this in the only comment on a copy from 2024).
of course, as internet speeds increase and resources are cheaper to abuse, people lose sight of the downstream impacts of impatience and poor planning.
Failover is different and more of a concern, especially if the client doesn't respect multiple returned IPs.
And then if you're dealing with browsers, they're not the best at trying everything, or they may wait a long time before trying another host if the first is non-responsive. For browsers and rotations that really do change, I like a 60 second TTL. If it's pretty stable most of the time, 15 minutes most of the time, and crank it down before intentional changes.
If you've got a smart client that will get all the answers, and reasonably try them, then 5-60 minutes seems reasonable, depending on how often you make big changes.
All that said, some caches will keep your records basically forever, and there's not much you can do about that. Just gotta live with it.
And a BGP failure is a good example too. It doesn't matter how resilient the failover mechanisms for one IP are if the routing tables are wrong.
Agreed about some providers enforcing a larger one, though. DNS propagation is wildly inconsistent.
Relatively simple inside a network range you control but no idea how that works across different networks in geographical redundant setups
Seems like you'd be trying to work against the basic design principles of Internet routing at that point.