Recovering from DNS Outages in Distributed Systems(singh-sanjay.com)

3 pointsby singhsanjay127 hours ago1 comment

singhsanjay127 hours ago
DNS keeps showing up in outage postmortems, but what's often missing is discussion about recovery, not just prevention.
In this post, I break down common DNS failure patterns (TTL propagation, resolver overload, control plane dependency loops) and why recovery can deadlock when your tooling itself depends on DNS.
I'd love to hear how others design around this:
Do you use DNS-independent fallbacks?
Static seed lists?
Separate control plane resolution?
Aggressive caching vs short TTLs?
Curious what patterns have worked (or failed) in real systems for folks.
- Bender7 hours ago
  For me and also the place I retired from the optimal solutions was an instance of Unbound [1] on every node keeping local cache, retrying edge resolvers intelligently, preferring the fastest responding edge resolvers, cap on min-ttl or both resource records and infrastructure, pre-caching, etc... I've done that at home and when others talk about a DNS outage I have to go out of my way to see or replicate it usually by forcing a flush of the cache.
  Most Linux distributions have a build of Unbound. I point edge DNS recursive resolvers to the root servers rather than leaking internal systems requests to Cloudflare or Google. Unbound can also be configured to not forward internal names or to point requests for internal names to specific upstream servers.
  [1] - https://nlnetlabs.nl/projects/unbound/about/
  - singhsanjay127 hours ago
    Nice. Running Unbound locally with intelligent upstream selection and caching definitely reduces blast radius from edge resolver outages.
    I haven't tried Unbound but I’m curious though, how do you handle recovery behavior when the failure isn’t just recursive resolver unavailability, but scenarios like stale IPs after control plane failover, or long-lived gRPC connections that never re-resolve, or bootstrap loops where the system that needs to reconfigure DNS itself depends on DNS?
    In my experience, local recursive resolvers solve availability pretty well, but recovery semantics still depend heavily on client behavior and connection lifecycle management.
    Do you rely on aggressive re-resolution policies at the application layer? Or force connection churn after TTL expiry?
    Would love to understand how you think about resolver-level resilience vs application-level recovery.
    Bender6 hours ago
    stale IPs after control plane failover
    We did not have to do this but in that scenario I would have automation reach out to Unbound and drop the cache for that particular zone or sub-domain. A script could force fetching the new records for any given zone to rebuild the cache.
    Or force connection churn after TTL expiry?
    The TTL can be kept low and Unbound told to hold the last known IP after resolution accepting this breaks an RFC and the apps may hold onto the wrong IP for too long and then Unbound will request it from upstream again to get the new IP. There is no one right answer. Whomever is the architect for the environment in question would have to decide with methods they believe will be more resilient and then test failure conditions when they do chaos testing. Anywhere there is a gap in resilience should be part of monitoring and automation when the bad behavior can not be eliminated through app/infra configuration.
    how you think about resolver-level resilience vs application-level recovery
    Well sadly the people managing or architecting the infrastructure may not have any input into how the applications manage DNS. Ideally both groups would meet and discuss options if this is a greenfield deployment. If not then the second best option would be to discuss the platform behavior with a subject matter expert in addition to an operations manager that can summarize all the DNS failures, root cause analysis and restoration methods to determine what behavior should be configured into the stack. Here again there is no one right answer. As a group they will have to decide at which layer DNS retries occur most aggressively and how much input automation will have at the app and infra layers.
    The overall priority should be to ensure that past DNS issues known-knowns are designed out of the system. That leaves only unknown-unknowns. to be dealt with in a reactive state, possibly first with automation and then with an operations or SRE team.
    Take a look through the Unbound configuration directives [2] to see some of the options available.
    [2] - https://nlnetlabs.nl/documentation/unbound/unbound.conf/
    singhsanjay126 hours ago
    This matches what I've seen too.
    Resolver-level resilience is often manageable centrally. The harder part is application-level recovery; especially in larger orgs where DNS behavior spans multiple teams.
    Even with low TTLs or cache flush automation, apps may: resolve once at startup or hold long-lived gRPC/TCP connections, or even worse -> ignore TTL semantics entirely
    So infra assumes "DNS healed," but the app never re-resolves.
    Bender6 hours ago
    or even worse -> ignore TTL semantics entirely
    I sense you may have java in your environment and are probably used to
    export JAVA_OPTS="-Djava.net.preferIPv4Stack=true -Dnetworkaddress.cache.ttl=0"
    or something along that line including other options. At least I tried to get teams to use those and then rely on Unbound DNS cache and retry schemes. SystemD also has it's own resolver cache which can be disabled and told to use a local instance of Unbound. Windows servers require Group Policy and registry modifications to change their behavior.
    One of my pet peeves is when groups do not manage domain/search correctly and they do not use FQDN in the application configuration resulting in 3x or 4x or more the number of DNS requests which also amplifies all DNS problems/outages. That really grinds my gears.
    And of course if the Linux system uses glibc, editing /etc/gai.conf to prefer IPv4 or IPv6 depending on what is primarily used inside the data-center makes a big difference.