Cloudflare 1.1.1.1 Incident on July 14, 2025(blog.cloudflare.com)

581 pointsby nomaxx1177 months ago39 comments

chrismorgan7 months ago
I’m surprised at the delay in impact detection: it took their internal health service more than five minutes to notice (or at least alert) that their main protocol’s traffic had abruptly dropped to around 10% of expected and was staying there. Without ever having been involved in monitoring at that kind of scale, I’d have pictured alarms firing for something that extreme within a minute. I’m curious for description of how and why that might be, and whether it’s reasonable or surprising to professionals in that space too.
- perlgeek7 months ago
  There's a constant tension between speed of detection and false positive rates.
  Traditional monitoring systems like Nagios and Icinga have settings where they only open events/alerts if a check failed three times in a row, because spurious failed checks are quite common.
  If you spam your operators with lots of alerts for monitoring checks that fix themselves, you stress the unnecessarily and create alert blindness, because the first reaction will be "let's wait if it fixes itself".
  I've never operated a service with as much exposure as CF's DNS service, but I'm not really surprised that it took 8 minutes to get a reliable detection.
  - sbergot7 months ago
    I work on the SSO stack in a b2b company with about 200k monthly active users. One blind spot in our monitoring is when an error occurs on the client's identity provider because of a problem on our side. The service is unusable and we don't have any error logs to raise an alert. We tried to setup an alert based on expected vs actual traffic but we concluded that it would create more problems for the reason you provided.
    grinich7 months ago
    [flagged]
  - chrismorgan7 months ago
    At Cloudflare’s scale on 1.1.1.1, I’d imagine you could do something comparatively simple like track ten-minute and ten-second rolling averages (I know, I know, I make that sound much easier and more practical than it actually would be), and if they differ by more than 50%, sound the alarm. (Maybe the exact numbers would need to be tweaked, e.g. 20 seconds or 80%, but it’s the idea.)
    Were it much less than 1.1.1.1 itself, taking longer than a minute to alarm probably wouldn’t surprise me, but this is 1.1.1.1, they’re dealing with vasts amounts of probably fairly consistent traffic.
    Anon10967 months ago
    I work on something at a similar scale to 1.1.1.1, if we had this kind of setup our oncall would never be asleep (well, that is almost already the case, but alas). It's easy to say "just implement X monitor and you'd have caught this" but there's a real human cost and you have to work extremely vigilently at deleting monitors or you'll be absolutely swamped with endless false positive pages. I don't think a 5 minute delay is unreasonable for a service this scale.
    chrismorgan7 months ago
    This just seems kinda fundamental: the entire service was basically down, and it took 6+ minutes to notice? I’m just increasingly perplexed at how that could be. This isn’t an advanced monitor, this is perhaps the first and most important monitor I’d expect to implement (based on no closely relevant experience).
    roughly7 months ago
    > based on no closely relevant experience
    I don’t want to devolve this to an argument from authority, but - there’s a lot of trade offs to monitoring systems, especially at that scale. Among other things, aggregation takes time at scale, and with enough metrics and numbers coming in, your variance is all over the place. A core fact about distributed systems at this scale is that something is always broken somewhere in the stack - the law of averages demands it, and so if you’re going to do an all-fire-alarm alert any time part of the system isn’t working, you’ve got alarms going off 24/7. Actually detecting that an actual incident is actually happening on a machine of the size and complexity we’re talking about within 5 minutes is absolutely fantastic.
    perlgeek7 months ago
    I'm sure some engineer at cloudflare is evaluating something like this right now, and try it on historical data how many false positives that would've generated in the past, if any.
    Thing is, it's probably still some engineering effort, and most orgs only really improve their monitoring after it turned out to be sub-optimal.
    chrismorgan7 months ago
    This is hardly the first 1.1.1.1 outage. It’s also probably about the first external monitoring behaviour I imagine you’d come up with. That’s why I’m surprised—more surprised the longer I think about it, actually; more than five minutes is a really long delay to notice such a fundamental breakage.
    roughly7 months ago
    Is your external monitor working? How many checks failed, in what order? Across how many different regions or systems? Was it a transient failure? How many times do you retry, and at what cadence? Do you push your success or failure metrics? Do you pull? What if your metrics don’t make it back? How long do you wait before considering it a problem? What other checks do you run, and how long do those take? What kind of latency is acceptable for checks like that? How many false alarms are you willing to accept, and at what cadence?
    briangriffinfan7 months ago
    I would want to make sure we avoid "We should always do the exact specific thing that would have prevented this exact specific issue"-style thinking.
- bombcar7 months ago
  This is one of those graphs that would have been on the giant wall in the NOC in the old days - someone would glance up and see it had dropped and say “that’s not right” and start scrambling.
  - seb12047 months ago
    That's how I picture it. Is that not how it is? Everyone working from home and the big chart is on the TV but someone in the family changed channels?
- TheDong7 months ago
  I'm not surprised.
  Let's say you've got a metric aggregation service, and that service crashes.
  What does that result in? Metrics get delayed until your orchestration system redeploys that service elsewhere, which looks like a 100% drop in metrics.
  Most orchestration take a sec to redeploy in this case, assuming that it could be a temporary outage of the node (like a network blip of some sort).
  Sooo, if you alert after just a minute, you end up with people getting woken up at 2am for nothing.
  What happens if you keep waking up people at 2am for something that auto-resolves in 5 minutes? People quit, or eventually adjust the alert to 5 minutes.
  I know you often can differentiate no data and real drops, but the overall point, of "if you page people constantly, people will quit" I think is the important one. If people keep getting paged for too tight alarms, the alarms can and should be loosened... and that's one way you end up at 5 minutes.
  - __turbobrew__7 months ago
    The real issue in your hypothetical scenario is a single bad metrics instance can bring the entire thing down. You could deploy multiple geographically distributed metrics aggregation services which establish the “canonical state” through a RAFT/PAXOS quorum. Then as long as a majority of metric aggregator instances are up the system will continue to work.
    When you are building systems like 1.1.1.1 having an alert rollup of five minutes is not acceptable as it will hide legitimate downtime that lasts between 0 and 5 minutes.
    You need to design systems which do not rely on orchestration to remediate short transient errors.
    Disclosure: I work on a core SRE team for a company with over 500 million users.
  - mentalgear7 months ago
    Its not wrong for smaller companies. But there's an argument that a big system critical company/provider like Cloudflare should be able to afford its own always on team with a night shift.
    misiek087 months ago
    Please don’t. It doesn’t make sense, doesn’t help, doesn’t improve anything and is just waste of money, time, power and people.
    Now without crying: I saw multiple, big companies getting rid of NOC and replacing that with on duties in multiple, focused teams. Instead of 12 people sitting 24/7 in group of 4 and doing some basic analysis and steps before calling others - you page correct people in 3-5 minutes, with exact and specific alert.
    Incident resolution times went greatly down (2-10x times - depends on company), people don’t have to sit overnight and sleep for most of the time and no stupid actions like service restart taken to slow down incident resolution.
    And I’m not liking that some platforms hire 1500 people for job that could be done with 50-100, but in terms of incident response - if you already have teams with separated responsibilities then NOC it’s "legacy"
    immibis7 months ago
    24/7 on-call is basically mandatory at any major network, which cloudflare is. Your contractual relations with other networks will require it.
    easterncalculus7 months ago
    I'm not convinced that the SWE crowd of HN, particularly the crowd showing up to every thread about AI 'agents' really knows what it takes to run a global network or what a NOC is. I know saying this on here runs the risk of Vint Cerf or someone like that showing up in my replies, but this is seriously getting out of hand now. Every HN thread that isn't about fawning over AI companies is devolving into armchair redditor analysis of topics people know nothing about. This has gotten way worse since the pre-ChatGPT days.
    JohnMakin7 months ago
    Lol preach
    (Have worked as SRE at large global platform)
    I just mostly over the last few years tune out such responses and try not to engage them. The whole uninformed "Well, if it were me, I would simply not do that" kind of comment style has been pervasive on this site for longer than AI though, IMO.
    genewitch7 months ago
    > Every HN thread that isn't about fawning over AI companies is devolving into armchair redditor analysis of topics people know nothing about.
    It took me a very long time to realize that^. I've worked with two NOC at two huge companies, and i know they still exist as teams at those companies. I'm not an SWE, though. And I'm not certain i'd qualify either company as truly "global" except in the loosest sense - as in, one has "American" in the name of the primary subsidiary.
    ^ i even regularly have used "the comments were people incorrecting each other about <x>", so i knew subconsciously that HN is just a different subset of general internet comments. The issue comes from this site appearing to be moderated, and the group of people that select for commenting here seem like they would be above average at understanding and backing up claims. The "incorrecting" label comes from n-gate, which hasn't been updated since the early '20s, last i checked.
    degamad7 months ago
    The question is, which is better: 24/7 shift work (so that someone is always at work to respond, with disrupted sleep schedules at regular planned intervals) or 24/7 on-call (with monitoring and alerting that results in random intermittent disruptions to sleep, sometimes for false positives)?
    chrismorgan7 months ago
    Not even a night shift, just normal working hours in another part of the world.
    bigiain7 months ago
    There are kinds big step/jumps as the size of a company goes up.
    Step 1: You start out with the founders being on call 27x7x365 or people in the first 10 or 20 hires "carry the pager" on weekends and evenings and your entire company is doing unpaid rostered on call.
    Step 2: You steal all the underwear.
    Step 3: You have follow-the-sun office-hours support staff teams distributed around the globe with sufficient coverage for vacations and unexpected illness or resignations.
    chrismorgan7 months ago
    I confess myself bemused by your Step 2.
    bigiain7 months ago
    I'm like, come on! It's a South Park reference? Surely everybody here gets that???
    <google google google>
    "Original air date: December 16, 1998"
    Oh, right. Half of you weren't even born... Now I feel ooooooold.
    amelius7 months ago
    I think it is reasonable if the alarm trigger time is, say 5-10% of the time required to fix most problems.
    amelius7 months ago
    Instead of downvoting me, I'd like to know why this is not reasonable?
  - croemer7 months ago
    It's not rocket science. You do a 2 stage thing: Why not check if the aggregation service has crashed before firing the alarm if it's within the first 5 minutes? How many types of false positives can there be? You just need to eliminate the most common ones and you gradually end up with fewer of them.
    Before you fire a quick alarm, check that the node is up, check that the service is up etc.
    sophacles7 months ago
    > How many types of false positives can there be?
    Operating at the scale of cloudflare? A lot.
    * traffic appears to be down 90% but we're only getting metrics from the regions of the world that are asleep because of some pipeline error
    * traffic appears to be down 90% but someone put in a firewall rule causing the metrics to be dropped
    * traffic appears to be down 90% but actually the counter rolled over and prometheus handled it wrong
    * traffic appears to be down 90% but the timing of the new release just caused polling to show wierd numbers
    * traffic appears to be down 90% but actually there was a metrics reporting spike and there was pipeline lag
    * traffic appears to be down 90% but it turns out that the team that handles transit links forgot to put the right acls around snmp so we're just not collecting metrics for 90% of our traffic
    * I keep getting alerts for traffic down 90%.... thousands and thousands of them, but it turns out that really its just that this rarely used alert had some bitrot and doesn't use the aggregate metrics but the per-system ones.
    * traffic is actually down 90% because theres an internet routing issue (not the dns team's problem)
    * traffic is actually down 90% at one datacenter because of a fiber cut somewhere
    * traffic is actually down 90% because the normal usage pattern is trough traffic volume is 10% of peak traffic volume
    * traffic is down 90% from 10s ago, but 10s ago there was an unusual spike in traffic.
    And then you get into all sorts of additional issues caused by the scale and distributed nature of a metrics system that monitors a huge global network of datacenters.
- kccqzy7 months ago
  Having alarms firing within a minute just becomes a stress test for your alarm infrastructure. Is your alarm infrastructure able to get metrics and perform calculations consistently within a minute of real time?
- bastawhiz7 months ago
  The service almost certainly wasn't completely hard down at the time the impact began, especially if that's the start of a global rollout. It would have taken time for the impact to become measurable.
- philipwhiuk7 months ago
  Remember they have no SLA for this service.
  - chrismorgan7 months ago
    So?
    They have a rather significant vested interest in it being reliable.
v5v37 months ago
> For many users, not being able to resolve names using the 1.1.1.1 Resolver meant that basically all Internet services were unavailable.
Don't you normally have 2 DnS servers listed on any device. So was the second also down, if not why didn't it go to that.
- Polizeiposaune7 months ago
  Cloudflare recommends you configure 1.1.1.1 and 1.0.0.1 as DNS servers.
  Unfortunately, the configuration mistake that caused this outage disabled Cloudflare's BGP advertisements of both 1.1.1.0/24 and 1.0.0.0/24 prefixes to its peers.
  - kingnothing7 months ago
    A better recommendation is to use Cloudflare for one of your DNS servers and a completely different company for the other.
    itake7 months ago
    Just wondering, how do y'all manage wifi portals and manually setting DNS services? I used to use cf and google's but it was so annoying to disable and re-enable that every time I use a public wifi network.
    gerdesj7 months ago
    DNS is just a look up service. You can filter it or not but in the end it is just a "source of truth".
    What are you trying to do on your wifi?
    itake7 months ago
    Many wifi networks redirect non-encrypted http traffic to their captive portal. For the redirect to work, your DNS needs to be the default one provided by the router so that http://neverssl.com resolves to the wifi's "Please accept our ToS to get online" page.
    If you aren't using their DNS, then your network requests just get dropped (as you're not approved yet). You need their DNS to learn how to access their captive host so they can whitelist your mac address.
    kingnothing7 months ago
    My clients use DHCP for everything and are always connected to my home VPN. If I'm away from home and need to connect to a captive network, I'll turn off the VPN, connect, then re-enable the VPN. I run unbound at home for DNS.
    itake7 months ago
    While I run a home VPN, I think using it exclusively runs into issues:
    - frequently capture portals only permit access for 1-2hr. Your internet get cut off, then you have to realize its not a temporary issue, but portal issue, then you close the vpn, try to find the captive portal, and re-auth.
    - latency is too high for my home vpn when I travel in asia
    butlike7 months ago
    Yeah but on paper they're never going to recommend using a competitor
- rom1v7 months ago
  On Android, in Settings, Network & internet, Private DNS, you can only provide one in "Private DNS provider hostname" (AFAIK).
  Btw, I really don't understand why it does not accept an IP (1.1.1.1), so you have to give an address (one.one.one.one). It would be more sensible to configure a DNS server from an IP rather than from an address to be resolved by a DNS server :/
  - quacksilver7 months ago
    Private DNS on Android refers to 'DNS over HTTPS' and would normally only accept a hostname.
    Normal DNS can normally be changed in your connection settings for a given connection on most flavours of Android.
    fs1117 months ago
    No, it is not DNS over HTTPS it is DNS over TLS, which is different.
    lxgr7 months ago
    Android 11 and newer support both DoH and DoT.
    politelemon7 months ago
    Where is this option? How can I distinguish the two, the dialog simply asks for a host name
    eptcyka7 months ago
    Cloudflare has valid certs for 1.1.1.1
    rom1v7 months ago
    > Private DNS on Android refers to 'DNS over HTTPS'
    Yes, sorry, I did not mention it.
    So if you want to use DNS over HTTPS on Android, it is not possible to provide a fallback.
    ignoramous7 months ago
    > So if you want to use DNS over HTTPS on Android, it is not possible to provide a fallback.
    Not true. If the (DoH) host has multiple A/AAAA records (multiple IPs), any decent DoH client would retry its requests over multiple or all of those IPs.
    lxgr7 months ago
    Does Cloudflare offer any hostname that also resolves to a different organization’s resolver (which must also have a TLS certificate for the Cloudflare hostname or DoH clients won’t be able to connect)?
    ignoramous7 months ago
    Usually, for plain old DNS, primary and secondary resolvers are from the same provider, serving from distinct IPs.
    lxgr7 months ago
    Yes, but you were talking about DoH. I don’t know how that could plausibly work.
    ignoramous7 months ago
    > but you were talking about DoH
    DoH hosts can resolve to multiple IPs (and even different IPs for different clients)?
    Also see TFA
    It's worth noting that DoH (DNS-over-HTTPS) traffic remained relatively stable as most DoH users use the domain cloudflare-dns.com, configured manually or through their browser, to access the public DNS resolver, rather than by IP address. DoH remained available and traffic was mostly unaffected as cloudflare-dns.com uses a different set of IP addresses.
    lxgr7 months ago
    > DoH hosts can resolve to multiple IPs (and even different IPs for different clients)?
    Yes, but not from a different organization. That was GPs point with
    > So if you want to use DNS over HTTPS on Android, it is not possible to provide a fallback.
    A cross-organizational fallback is not possible with DoH in many clients, but it is with plain old DNS.
    > It's worth noting that DoH (DNS-over-HTTPS) traffic remained relatively stable as most DoH users use the domain cloudflare-dns.com
    Yes, but that has nothing to do with failovers to an infrastructurally/operationally separate secondary server.
    ignoramous7 months ago
    > A cross-organizational fallback is not possible with DoH in many clients, but it is with plain old DNS.
    That's client implementation lacking, not some issue inherent to DoH?
    The DoH client is configured with a URI Template, which describes how to construct the URL to use for resolution. Configuration, discovery, and updating of the URI Template is done out of band from this protocol. Note that configuration might be manual (such as a user typing URI Templates in a user interface for "options") or automatic (such as URI Templates being supplied in responses from DHCP or similar protocols). DoH servers MAY support more than one URI Template. This allows the different endpoints to have different properties, such as different authentication requirements or service-level guarantees.
    https://datatracker.ietf.org/doc/html/rfc8484#section-3
    lxgr7 months ago
    Yes, but this restriction of only a single DoH URL seems to be the norm for many popular implementations. The protocol theoretically allowing better behavior doesn't really help people using these.
    quaintdev7 months ago
    Its DNS over TLS. Android does not support DNS over HTTPS except Google's DNS
    lxgr7 months ago
    It does since Android 11.
    Tarball107 months ago
    For a limited set of DoH providers. It does not let you enter a custom DoH URL, only a DoT hostname.
    KoolKat237 months ago
    As far as I understand it, it's Google or Cloudflare?
- Macha7 months ago
  Cloudflare's own suggested config is to use their backup server 1.0.0.1 as the secondary DNS, which was also affected by this incident.
  - stingraycharles7 months ago
    TBH at this point the failure modes in which 1.1.1.1 would go down and 1.0.0.1 would not are not that many. At CloudFlare’s scale, it’s hardly believable a single of these DNS servers would go down, and it’s rather a large-scale system failure.
    But I understand why Cloudflare can’t just say “use 8.8.8.8 as your backup”.
    bombcar7 months ago
    At least some machines/routers do NOT have a primary and backup but instead randomly round-robin between them.
    Which means that you’d be on cloudflare half the time and on google half the time which may not be what you wanted.
    toast07 months ago
    It would depend on how Cloudflare set up their systems. From this and other outages, I think it's pretty clear that they've set up their systems as a single failure domain. But it would be possible for them to have setup for 1.1.1.1 and 1.0.0.1 to have separate failure domains --- separate infrastructure, at least some sites running one but not the other.
- Gieron7 months ago
  I think normally you pair 1.1.1.1 with 1.0.0.1 and, if I understand this correctly, both were down.
  - moontear7 months ago
    Just pair 1.1.1.1 with 9.9.9.9 (Quad9) so you have fault tolerance in terms of provider as well.
    Aachen7 months ago
    I became a bit disillusioned with quad9 when they started refusing to resolve my website. It's like wetransfer but supporting wget and without the AI scanning or interstitials. A user had uploaded malware and presumably sent the link to a malware scanner. Instead of reporting the malicious upload or blocking the specific URL¹, the whole domain is now blocked on a DNS level. The competing wetransfer.com resolves just fine at 9.9.9.9
    I haven't been able to find any recourse. The malware was online for a few hours but it has been weeks and there seems to be no way to clear my name. Someone on github (the website is open source) suggested that it's probably because they didn't know of the website, like everyone heard of wetransfer and github and so they don't get the whole domain blocked for malicious user content. I can't find any other difference, but also no responsible party to ask. The false-positive reporting tool on quad9's website just reloads the page and doesn't do anything
    ¹ I'm aware DNS can't do this, but with a direct way of contacting a very responsive admin (no captchas or annoying forms, just email), I'd not expect scanners to resort to blocking the domain outright to begin with, at least not after they heard back the first time and the problematic content has been cleared swiftly
    Quad97 months ago
    What is your ticket #? Let's see if we can get this resolved for you.
    Aachen7 months ago
    Oh hey, didn't expect this to actually be seen by many people, let alone you guys!
    There was no ticket number yet because I was mainly trying to resolve it upstream (whoever made it get into uBlock's default block list, Quad9, and probably other places) and then today when I checked your site specifically, the link in "False Positive? <Please contact us>" (when you do a lookup for a blocked domain) just links back to itself so I couldn't open a case there either. Now that I look at the page again, with the advice in mind from a sibling comment to just email you, I now see that maybe this is supposed to go to the generic contact form and I needn't go through this domain status page. Opening the contact page now, I see that removal from blocklist is a selectable option so I'll use that :)
    The ticket number I just submitted is 41905. Not that I'd want you to now apply preferential treatment, I didn't expect my post above to be seen by many people though I very much appreciate that you've reached out here. Makes me think you're actually interested in resolving this type of issue for small website operators, where the complete block without so much as a heads up felt a bit, well, like that might not get me anywhere. If the process just works as it normally should, that's good enough for me! Thanks for encouraging me to actually open a ticket!
    Quad97 months ago
    Glad to hear you were able to submit a ticket! The website form wasn't working a brief time ago. But YES, we want to help! You can DM me in the fedi if you need anything: https://mastodon.social/@quad9dns
    dmitrygr7 months ago
    Why not address the REAL issue:
    > I haven't been able to find any recourse. [...] there seems to be no way to clear my name.
    seb12047 months ago
    From the parent comment the path of recourse is a ticket. Does not help if hn is needed to have it looked at.
    BenjiWiebe7 months ago
    Looks like no ticket was actually created until now though.
    mnordhoff7 months ago
    You should email them about the form and about your domain. Their email address is listed on the website. <https://quad9.net/support/contact/>
    Sometimes the upstream blocklist provider will be easy to contact directly as well. Sometimes not so much.
    ajdude7 months ago
    I've been the victim of similar abuse before, for my mail servers and one of my community forums that I used to run. It's frustrating when you try to do everything right but you're at the mercy of a cold and uncompromising rules engine.
    You just convinced me to ditch quad9.
    Aachen7 months ago
    In the ticket I just opened (see sibling thread), I asked which blocklist my domain was on. Maybe let's see what comes out of it, perhaps they can improve the process (e.g. drop that blocklist, or notify the abuse record of domains which they're blocking so that domain owners are at least aware of where they can go to fix things)
    I don't see contact info on your profile or website/blog, but I can post here what the outcome is
    Edit: I love your blog's theme btw!
    baobabKoodaa7 months ago
    Windows 11 does not allow using this combination
    antonvs7 months ago
    You can use it, you just need to set the DNS over HTTPS templates correctly, since there's an issue with the defaults it tries to use when mixing providers.
    The templates you need are:
    1.1.1.1: https://cloudflare-dns.com/dns-query
    9.9.9.9: https://dns.quad9.net/dns-query
    8.8.8.8: https://dns.google/dns-query
    See https://learn.microsoft.com/en-us/windows-server/networking/... for info on how to set the templates.
    baobabKoodaa7 months ago
    Awesome! Thank you!
    antonvs7 months ago
    You're welcome. btw I came across a description of doing it via the GUI here: https://github.com/Curious4Tech/DNS-over-HTTPS-Set-Up
    snickerdoodle127 months ago
    Huh? Did they break the primary/secondary DNS server setup that has been present in all operating systems for decades?
    antonvs7 months ago
    DNS over HTTPS adds a requirement for an additional field - a URL template - and Windows doesn't handle defaulting that correctly in all cases. If you set them manually it works fine.
    snickerdoodle127 months ago
    What does that have to do with plain old dns?
    antonvs7 months ago
    Nothing, but Windows can automatically use DNS over HTTPS if it recognizes the server, which is the source of the issue the other commenter mentioned.
    lxgr7 months ago
    How so? Does it reject a secondary DNS server that’s not in the same subnet or something similar?
    antonvs7 months ago
    It's using DNS over HTTPS, and it doesn't default the URL templates correctly when mixing (some) providers. You can set them manually though, and it works.
    lxgr7 months ago
    Ah, this is for DoH, gotcha!
    This "URL template" thing seems odd – is Windows doing something like creating a URL out of the DNS IP and a pattern, e.g. 1.1.1.1 + "https://<ip>/foo" would yield https://1.1.1.1/foo?
    If so, why not just allow providing an actual URL for each server?
    antonvs7 months ago
    It does allow you to provide a URL for each server. The issue is just that its default behavior doesn't work for all providers. I have another comment in this thread telling the original commenter how to configure it.
    lxgr7 months ago
    Very cool, thank you!
    AStonesThrow7 months ago
    [dead]
    rvnx7 months ago
    Quad9 is reselling the traffic logs, so it means if you connect to secret hosts (like for your work), they will be leaked
    daneel_w7 months ago
    Could you show a citation? Your statement completely opposes Quad9's official information as published on quad9.net, and what's more it doesn't align at all with Bill Woodcock's known advocacy for privacy.
    gruez7 months ago
    See: https://quad9.net/privacy/policy/
    It doesn't say they sell traffic logs outright, but they do send telemetry on blocked domains to the blocklist provider, and provides "a sparse statistical sampling of timestamped DNS responses" to "a very few carefully vetted security researchers". That's not exactly "selling traffic logs", but is fairly close. Moreover colloquially speaking, it's not uncommon to claim "google sells your data", even they don't provide dumps and only disclose aggregated data.
    daneel_w7 months ago
    Disagree that it's fairly close to the statement "they resell traffic logs" and the implication that they leak all queried hostnames ("secret hosts, like for your work, will be leaked"). Unless Quad9 is deceiving users, both statements are, in fact, completely false.
    https://quad9.net/privacy/policy/#22-data-collected
    gruez7 months ago
    >and the implication that they leak all queried hostnames ("secret hosts, like for your work, will be leaked").
    The part about sharing data with "a very few carefully vetted security researchers" doesn't preclude them from leaking domains. For instance if the security researcher exports a "SELECT COUNT(*) GROUP BY hostname" query that would arguably count as "summary form", and would include any secret hostnames.
    >https://quad9.net/privacy/policy/#22-data-collected
    If you're trying to imply that they can't possibly be leaking hostnames because they don't collect hostnames, that's directly contradicted by the subsequent sections, which specifically mention that they share metrics grouped by hostname basis. Obviously they'll need to collect hostname to provide such information.
    daneel_w7 months ago
    I'm implying that I'm convinced they are not storing statistics on (thus leaking) every queried hostname. By your very own admission, they clearly state that they perform statistics on a set of malicious domains provided by a third party, as part of their blocking program. Additionally they publish a "top 500 domains" list regularly. You're really having a go with the shoehorn if you want "secret domains, like for your work" (read: every distinct domain queried) to fit here.
    gruez7 months ago
    >I'm implying that I'm convinced they are not storing statistics on (thus leaking) every queried hostname. By your very own admission, they clearly state that they perform statistics on a set of malicious domains provided by a third party, as part of their blocking program.
    Right, but the privacy policy also says there's a separate program for "a very few carefully vetted security researchers" where they can get data in "summary form", which can leak domain name in the manner I described in my previous comment. Maybe they have a great IRB (or similar) that would prevent this from happening, but that's not mentioned in the privacy policy. Therefore it's totally in the realm of possibility that secret domain names could be leaked, no "really having a go with the shoehorn" required.
    Quad97 months ago
    We are fully committed to end-user privacy. As a result, Quad9 is intentionally designed to be incapable of capturing end-users' PII. Our privacy policy is clear that queries are never associated with individual persons or IP addresses, and this policy is embedded in the technical (in)capabilities of our systems.
    rvnx7 months ago
    It is about the hostnames themselves like: git.nationalpolice.se but I understand that there is not much choice if you want to keep the service free to use so this is fair
    staviette7 months ago
    Is that really a concern for most people? Trying to keep hostnames secret is a losing battle anyways these days.
    You should probably be using a trusted TLS certificate for your git hosting. And that means the host name will end up in certificate transparency logs which are even easier to scrape than DNS queries.
    moontear6 months ago
    You would probably use wildcard certificates to NOT leak those subdomains
    Demiurge7 months ago
    Is this true? They claim that they don't keep any logs. Do you have a source?
    jeffbee7 months ago
    They don't claim that. Less than a week ago HN discussed their top resolved domains report. Such a report implies they have logs.
    Demiurge7 months ago
    From their homepage:
    > How Quad9 protects your privacy?
    > When your devices use Quad9 normally, no data containing your IP address is ever logged in any Quad9 system.
    Of course they have some kinds of logs. Aggregating resolved domains without logging client IPs is not what the implication of "Quad9 is reselling the traffic logs" seems to be.
    jeffbee7 months ago
    We're not discussing IP addresses, we are discussing whether their logs can leak your secret domain name.
    Demiurge7 months ago
    Thats more clear, I get your point now. Again, though, that's not how most people would read the original comment. I've never even contemplated that I might generate some hostnames existence of which might be considered sensitive. It seems like a terrible idea to begin with, as I'm sure there are other avenues for those "secret" domains to be leaked. Perhaps name your secret VMs vm1, vm2, ..., instead of <your root password>. But yeah, this is not my area of expertise, nor a concern for the vast majority of internet users who want more privacy than their ISP will provide.
    I am curious though, do you have any suggestions for alternative DNS that is better?
    jeffbee7 months ago
    I use Google DNS because I feel it suits my personal theory of privacy threats. Among the various public DNS resolver services, I feel that they have the best technical defenses agains insider snooping and outside hackers infiltrating their systems, and I am unperturbed about their permanent logs. I also don't care about Quad9's logs, except to the extent that it seems inconsistent with the privacy story they are selling. I used Quad9 as my resolver of last resort in my config. I doubt any queries actually go there in practice.
    sophacles7 months ago
    Im sorry... what is a secret hostname that is publicly resolvable?
    The very idea strikes me as irresponsible and misguided.
    notpushkin7 months ago
    It could be some subdomain that’s hard to guess. You can’t (generally) enumerate all subdomains through DNS, and if you use a wildcard TLS certificate (or self-signed / no cert at all), it won’t be leaked to CT logs either. Secret hostname.
    rvnx7 months ago
    Examples: github.internal.companyname.com or jira.corp.org or jenkins-ci.internal-finance.acme-corp.com or grafana.monitoring.initech.io or confluence.prod.internal.companyx.com etc
    These, if you don't know the host, you will not be able to hit the backend service. But if you know, you can start exploiting it, either by lack of auth, or by trying to exploit the software itself
  - Algent7 months ago
    Yeah pretty much. In a perfect world you would pair it with another service I guess but usually you use the official backup IP because it's not supposed to break at same time.
    lillecarl7 months ago
    I would rather fall back to the slow path of resolving through root servers than fall back from one recursive resolver to another.
  - rvnx7 months ago
    8.8.8.8 + 1.1.1.1 is stable and mostly safe
    ziml777 months ago
    This is what I do. I have both services set in my router, so the full list it tries are 1.1.1.1, 1.0.0.1, 8.8.8.8, and 8.8.4.4
    baobabKoodaa7 months ago
    Windows 11 does not allow using this combination
    heraldgeezer7 months ago
    it does if you set it on the interface
- sschueller7 months ago
  Yes, I would also highly recommend using a DNS closest to you (for those that have ISPs that don't mess around (blocking etc.) with their DNS you usually get much better response times) and multiple from different providers.
  If your device doesn't support proper failover use a local DNS forwarder on your router or an external one.
  In Switzerland I would use Init7 (isp that doesn't filter) -> quad9 (unfiltered Version) -> eu dns0 (unfiltered Version)
  - dylan6047 months ago
    How busy in life are you that we're concerning ourselves with nearest DNS? Are you browsing the internet like a high frequency stock trader? Seriously, in everyone's day to day, other than when these incidents happen, does someone notice a delay from resolving a domain name?
    I get that in theory blah blah, but we now have choices in who gets to see all of our requests and the ISP will always lose out to the other losers in the list
    tredre37 months ago
    news.ycombinator.com has a TTL of 1, so every page load will do one DNS request (possibly multiple).
    If you choose a resolver that is very far, 100ms longer page loads do end add up quickly...
    sumtechguy7 months ago
    Even something simple like www.google.com serves from 5 different DNS names. I have seen as high as 50. It is surprisingly snappier. Especially on older browsers that would only have 2 connections at a time open. It adds up faster than you would intuitively think. I used to have local resolvers that would mess with the TTL. But that was more trouble than it was worth. But it also gave a decent speedup. Was it 'worth' doing. Well it was kinda fun to mess with, I guess.
    jeffbee7 months ago
    You know, I recently went through a period of thinking my MacBook was just broken. It had the janks. Everything on the browser was just slower than you're used to. After a week or two of pulling my hair, I figured it out. The newly-configured computer was using the DHCP-assigned DNS instead of Google DNS. Switched it, and it made a massive difference.
    dylan6047 months ago
    but that's the opposite of the request to move from a googDNS to a local one because of latency. so your ISP's DNS sucked, which is a broad statement, and is part of the why services like 1.1.1.1 or 8.8.8.8 exist. you didn't make the change of DNS because you were picking one based on nearest location.
    jeffbee7 months ago
    There is more to latency than distance. Server response time is also important. In my case, the problem was that the DNS forwarder in the local wifi access point/router was very slow, even though the ICMP latency from my laptop to that device is obviously low.
    dylan6047 months ago
    which is well and fine, but my original comment was that moving to a closer DNS isn't worth it just for being closer especially when it is usually your ISP's server. so now, you're confirming that just moving closer isn't the solve, so it just reassures that not using the closest DNS is just fine.
- gerdesj7 months ago
  If you think you can pontificate on DNS then I think you should be running your own service.
  Note how root "." just works and has done for decades - that's proper engineering and actually way more complicated than running 1.1.1.1. What 1.1.1.1 suffers from is anycast and not DNS.
  Cloudflare (and Google and co) insist on using one or more "vanity" IP addresses - that is very unfair of me but that it what it is, and to make it work, they have to use anycast.
  The real issue is fixing anycast and not DNS.
  Anyway, select two+ providers and set them.
  - wlonkly7 months ago
    The root servers all use anycast addresses.
    gerdesj7 months ago
    And yet they work - there is quite a lot of them!
    It's in no-one's interest to destroy DNS root.
    Are you sure they all use anycast? I probably ought to check.
- tyingq7 months ago
  Listing two is better than nothing, but it's not great. If one goes down, there's nothing that tracks which one is working, so you usually see long hangs and intermittent issues.
  Unless you do something fancy with a local caching dns proxy with more than one upstream.
- ahoka7 months ago
  Or run your own, if you are able to.
- zamadatix7 months ago
  1.1.1.1 is also what they call the resolver service as a whole, the impact section (seems to) be saying both 1.0.0.0/24 and 1.1.1.0/24 were affected (among other ranges).
- bmicraft7 months ago
  My Mikrotik router (and afaict all of them) don't support more than one DoH address.
- 7 months ago
  undefined
- rat99887 months ago
  Not all users have configured two DNS servers?
  - quacksilver7 months ago
    It is highly recommended to configure two or more DNS servers incase one is down.
    I would count not configuring at least two as 'user error'. Many systems require you to enter a primary and alternate server in order to save a configuration.
    tgv7 months ago
    The default setting on most computers seems to be: use the (wifi) router. I suppose telcos like that because it keeps the number of DNS requests down. So I wouldn't necessarily see it as user error.
    SketchySeaBeast7 months ago
    The funny part with that is that sites like cloudflare say "Oh, yeah, just use 1.0.0.1 as your alternate", when, in reality, it should be an entirely different service.
  - daneel_w7 months ago
    OK. But there's no reason or excuse not to, if they already manually configured a primary.
- bongodongobob7 months ago
  3 at every place I've ever worked.
- Bluescreenbuddy7 months ago
  Yup. I have Cloudfare and Quad9
kachapopopow7 months ago
Interesting to see that they probably lost 20% of 1.1.1.1 usage from a roughly 20 minute incident.
Not sure how cloudflare keeps struggling with issues like these, this isn't the first (and probably won't be the last) time they have these 'simple', 'deprecated', 'legacy' issues occuring.
8.8.8.8+8.8.4.4 hasn't had a global(1) second of downtime for almost a decade.
1: localized issues did exist, but that's really the fault of the internet and they did remain running when google itself suffered severe downtime in various different services.
- Tepix7 months ago
  There's more to DNS than just availability (granted, it's very important). There's also speed and privacy.
  European users might prefer one of the alternatives listed at https://european-alternatives.eu/category/public-dns over US corporations subject to the CLOUD act.
  - adornKey7 months ago
    I think just setting up Unbound is even less trouble. Servers come and go. Getting rid of the dependency altogether is better than having to worry who operates the DNS-servers and how long it's going to be available.
    genewitch7 months ago
    i am 95% certain i run unbound in a datacenter, and i have pihole local, my PC connects to pihole first, and if that's down, it connects to my DC; pihole connects to the DC and one of the filtered DNS providers (don't remember which) and GTEi's old server, that still works and has never let me down. No, not that one, the other one.
    i have musknet, though, so i can't edit the DNS providers on the router without buying another router, so cellphones aren't automatically on this plan, nor are VMs and the like.
    adornKey7 months ago
    Having a 2nd trustworthy router consumes extra energy, but maybe it's worth it. More than once my router made an update and silently disabled the pi-hole.
    Having a fully configured spare pi-hole in a box also helps. Another time my pi-hole refused to boot after a power outage.
    genewitch7 months ago
    well, i completely agree. I'm the author of a couple of "how-to run ipcop / monowall in <hypervisor>" articles on broadbandreports. So of course, when i heard i can get real, honest to goodness, publicly routable ipv6 on starlink with a third party router, i set to making one in proxmox local, here. None of the "router distributions" worked correctly, if at all, and none of them that i tried had ipv6 settings anywhere.
    So i went to best buy and bought 3 routers, and set each one up for 1 week. Turns out, you can get public routable ipv6 with a third party router, if the router supports ipv6.
    I still see people mentioning opnsense and pfsense on here from time to time, and i wonder if i got the wrong - maybe outdated - iso images? I also tried doing it with freebsd and debian and couldn't figure it out, which is a bit depressing for me. I'll try again someday.
  - daneel_w7 months ago
    Everyone, European or not, should prefer anything but Cloudflare and Google if they feel that privacy has any value.
  - immibis7 months ago
    HN users might prefer to run their own. It's a low maintenance service. It's not like running a mail server.
    daneel_w7 months ago
    I think that might be overestimating the technical prowess of HN readers on the whole. Sure, it doesn't require wizardry to set up e.g. Unbound as a catch-all DoT forwarder, but it's not the click'n'play most people require. It should be compared to just changing the system resolvers to dns0, Quad9 etc.
    kachapopopow7 months ago
    Running your own and being the sole user is the exact same thing as using a dns server (you need to obtain nameservers for any given domain which you have to contact a dns server for).
    BenjiWiebe7 months ago
    Except that your queries are spread out to different places instead of all being sent to a single server.
    You ask .com resolver for domain.com's NS, and then you ask ns1.domain.com for foo.domain.com. Then you browse to wikipedia.org, and none of those DNS queries go to the same place as the previous site.
    kachapopopow7 months ago
    That's arguably worse since you are now in direct contact with the nameservers (which are usually managed by non privacy orientated providers).
    lossolo7 months ago
    One issue here is that you can be tracked easily.
- kod7 months ago
  > Not sure how cloudflare keeps struggling with issues like these
  Cloudflare has a reasonable culture around incident response, but it doesn't incentivize proactive prevention.
- user39393827 months ago
  You’re not sure how they’re struggling to fix an engineering problem characterized by complexity and scale encountered by 0.001% of network engineers?
- zamadatix7 months ago
  Regarding the 20% some clients/resolvers will mark a server as temporarily down if it fails to respond to multiple queries in a row. That way the user doesn't have to wait the timeout delay 500 times in a row on the next 500 queries.
  From the longer term graphs it looks like volume returned to normal https://imgur.com/a/8a1H8eL
- barbazoo7 months ago
  Then you’d be using a google DNS though which is undesirable for many.
- heraldgeezer7 months ago
  Yes, I honestly switched back to 8.8.8.8 and 8.8.4.4 google DNS. 100% stable, no filtering, fast in the EU.
jallmann7 months ago
Good writeup.
> It’s worth noting that DoH (DNS-over-HTTPS) traffic remained relatively stable as most DoH users use the domain cloudflare-dns.com, configured manually or through their browser, to access the public DNS resolver, rather than by IP address.
Interesting, I was affected by this yesterday. My router (supposedly) had Cloudflare DoH enabled but nothing would resolve. Changing the DNS server to 8.8.8.8 fixed the issues.
- sneak7 months ago
  I disagree. The actual root cause here is shrouded in jargon that even experienced admins such as myself have to struggle to parse.
  It’s corporate newspeak. “legacy” isn’t a clear term, it’s used to abstract and obfuscate.
  > Legacy components do not leverage a gradual, staged deployment methodology. Cloudflare will deprecate these systems which enables modern progressive and health mediated deployment processes to provide earlier indication in a staged manner and rollback accordingly.
  I know what this means, but there’s absolutely no reason for it to be written in this inscrutable corporatese.
  - stingraycharles7 months ago
    I disagree, the target audience is also going to be less technical people, and the gist is clear to everyone: they just deploy this config from 0 to 100% to production, without feature gates or rollback. And they made changes to the config that wasn’t deployed for weeks until some other change was made, which also smells like a process error.
    I will not say whether or not it’s acceptable for a company of their size and maturity, but it’s definitely not hidden in corporate lingo.
    I do believe they could have elaborate more on the follow up steps they will take to prevent this from happening again, I don’t think staggered roll outs are the only answer to this, they’re just a safety net.
  - willejs7 months ago
    If you carry on reading, its quite obvious they misconfigured a service and routed production traffic to that instead of the correct service, and the system used to do that was built in 2018 and is considered legacy (probably because you can easily deploy bad configs). Given that, I wouldn't say the summary is "inscrutable corporatese" whatever that is.
    bigiain7 months ago
    I agree it's not "inscrutable corporatese"
    It's carefully written so my boss's boss thinks he understands it, and that we cannot possibly have that problem because we obviously don't have any "legacy components" because we are "modern and progressive".
    It is, in my opinion, closer to "intentionally misleading corporatese".
    noduerme7 months ago
    Joe Shmo committed the wrong config file to production. Innocent mistake. Sally caught it in 30 seconds. We were back up inside 2 minutes. Sent Joe to the margarita shop to recover his shattered nerves. Kid deserves a raise. Etc.
    sathackr7 months ago
    Yea the "timeline" indicating impact start/end is entirely false when you look at the traffic graph shared later in the post.
    Or they have a different definition of impact than I do
- bauruine7 months ago
  How does DoH work? Somehow you need to know the IP of cloudflare-dns.com first. Maybe your router uses 1.1.1.1 for this.
  - maxloh7 months ago
    Yeah, your operating system will first need to resolve cloudflare-dns.com. This initial resolution will likely occur unencrypted via the network's default DNS. Only then will your system query the resolved address for its DoH requests.
    Note that this introduces one query overhead per DNS request if the previous cache has expired. For this reason, I've been using https://1.1.1.1/dns-query instead.
    In theory, this should eliminate that overhead. Your operating system can validate the IP address of the DNS response by using the Subject Alternative Name (SAN) field within the CA certificate presented by the DoH server: https://g.co/gemini/share/40af4514cb6e
  - 7 months ago
    undefined
  - ta12437 months ago
    And even if you have already resolved it the TTL is only 5 minutes
  - stavros7 months ago
    Are we meant to use a domain? I've always just used the IP.
    landgenoot7 months ago
    You need a domain in order to get the s in https to work
    bigiain7 months ago
    That's not correct.
    LetEncrypt are trialling ip address https/TLS certificates right now:
    https://letsencrypt.org/2025/07/01/issuing-our-first-ip-addr...
    They say:
    "In principle, there’s no reason that a certificate couldn’t be issued for an IP address rather than a domain name, and in fact the technical and policy standards for certificates have always allowed this, with a handful of certificate authorities offering this service on a small scale."
    noduerme7 months ago
    right, this was announced about two weeks ago to some fanfare. So in principle there was no reason not to do it two decades ago? It would've been nice back then. I never heard of any certificate authority offering that.
    fs1117 months ago
    > I never heard of any certificate authority offering that.
    DigiCert does. That is where 1.1.1.1 and 9.9.9.9 get their valid certificates from
    crabique7 months ago
    Most CAs offer them, the only requirement is that it's at least an OV (not DV) level cert, and the subject organization proves it owns the IP address.
    bombcar7 months ago
    It the beginning of HTTPS you were supposed to look for the padlock to prove if was a safe site. Scammers wouldn’t take the time and money to get a cert, after all!
    So certs were often tied with identity which an IP really isn’t so few providers offered them.
    kbolino7 months ago
    An IP is about as much of an identity as a domain is.
    There are two main reasons IP certificates were not widely used in the past:
    - Before the SAN extension, there was just the CN, and there's only one CN per certificate. It would generally be a waste to set your only CN to a single IP address (or spend more money on more certs and the infrastructure to maintain them). A domain can resolve to multiple IPs, which can also be changed over time; users usually want to go to e.g. microsoft.com, not whatever IP that currently resolves to. We've had SANs for awhile now, so this limitation is gone.
    - Domain validation (serve this random DNS record) involves ordinary forward-lookup records under your domain. Trying to validate IP addresses over DNS would involve adding records to the reverse-lookup in-addr.arpa domain which varies in difficulty from annoying (you work for a large org that owns its own /8, /16, or /24) to impossible (you lease out a small number of unrelated IPs from a bottom-dollar ISP). IP addresses are much more doable now thanks to HTTP validation (serve this random page on port 80), but that was an unnecessary/unsupported modality before.
    maxloh7 months ago
    Nope. That is not correct. https://1.1.1.1/dns-query is a perfectly valid DoH resolver address I've been using for months.
    Your operating system can validate the IP address of the DNS response by using the Subject Alternative Name (SAN) field within the CA certificate presented by the DoH server: https://g.co/gemini/share/40af4514cb6e
    yread7 months ago
    what about certificate for IP address?
    landgenoot7 months ago
    What about a route that gets hijacked? There is no HSTS for IP addresses.
    sathackr7 months ago
    Presumably the route hijacker wouldn't have a valid private key for the certificate so they wouldn't pass validation
    federiconafria7 months ago
    What about a reverse DNS lookup?
  - stingraycharles7 months ago
    Yeah I don’t understand this part either, maybe it’s supposed to be bootstrapped using your ISP’s DNS server?
    tom13377 months ago
    Pretty much that. You set up a bootstrap DNS server (could be your ISPs or any other server) which then resolves the IP of the DoH server which then can be used for all future requests.
  - nelox7 months ago
    [flagged]
    k1t7 months ago
    Smells like AI and completely fails to answer the question.
    How is the IP address of the DoH server obtained?
    MayeulC7 months ago
    Firefox accepts a bootstrap IP, or uses the system resolver:
    > network.trr.bootstrapAddress
    > (default: none) by setting this field to the IP address of the host name used in "network.trr.uri", you can bypass using the system native resolver for it. Use this to get the IPs of the cloudflare server: https://dns.google/query?name=mozilla.cloudflare-dns.com
    > Starting with Firefox 74 setting the bootstrap address is no longer required in mode 3. Firefox will attempt to use regular DNS in order to get the IP address of the trusted resolver. However, if DNS resolution of the resolver domain fails, setting the bootstrap address is again necessary.
    Source: https://wiki.mozilla.org/Trusted_Recursive_Resolver
- noduerme7 months ago
  Funny. I was configuring a new domain today, and for about 20 minutes I could only reach it through Firefox on one laptop. Google's DNS tools showed it active. SSH to an Amazon server that could resolve it. My local network had no idea of it. Flush cache and all. Turns out I had that one FF browser set up to use Cloudflare's DoH.
- Hamuko7 months ago
  My (Unifi) router is set to automatic DoH, and I think that means it's using Cloudflare and Google. Didn't notice any disruptions so either the Cloudflare DoH kept working or it used the Google one while it was down.
  - zahrc7 months ago
    Check Jallmann’s response https://news.ycombinator.com/item?id=44578490#44578917
    TLDR; DoH was working
    Thorrez7 months ago
    AFAICS, Jallmann just left 1 comment and it was top-level. I'm not sure what you mean by "Jallmann’s response".
- sathackr7 months ago
  Good writeup except the entirely false timeline shared at the beginning of the post
  - bartvk7 months ago
    You need to clarify such a statement, in my opinion.
i_niks_867 months ago
Many commenters assume fallback behavior exists between DNS providers, but in practice, DNS clients - especially at the OS or router level -rarely implement robust failover for DoH. If you're using cloudflare-dns(.)com and it goes down, unless the stub resolver or router explicitly supports multi-provider failover (and uses a trust-on-first-use or pinned cert model), you’re stuck. The illusion of redundancy with DoH needs serious UX rethinking.
- tankenmate7 months ago
  I use routedns[0] for this specific reason it handles almost all DNS protocols; UDP, TCP, DoT, DoH, DoQ (including 0-RTT). But more importantly is has a very configurable route steering even down to a record by record basis if you want to put up with all the configuration involved. It's very robust and is very handy, I use 1.1.1.1 on my desktops and servers and when the incident happened I didn't even notice as the failover "just worked". I had to actually go look at the logs because I didn't notice.
  [0] https://github.com/folbricht/routedns
CuteDepravity7 months ago
It's crazy that both 1.1.1.1 and 1.0.0.1 where affected by the same change
I guess now we should start using a completely different provider as dns backup Maybe 8.8.8.8 or 9.9.9.9
- sammy22557 months ago
  1.1.1.1 and 1.0.0.1 are served by the same service. It's not advertised as a redundant fully separate backup or anything like that...
  - yjftsjthsd-h7 months ago
    Wait, then why does 1.0.0.1 exist? I'll grant I've never seen it advertised/documented as a backup, but I just assumed it must be because why else would you have two? (Given that 1.1.1.1 already isn't actually a single point, so I wouldn't think you need a second IP for load balancing reasons.)
    kalmar7 months ago
    I don't know of it's the reason, but inet_aton[0] and other parsing libraries that match its behaviour will parse 1.1 as 1.0.0.1. I use `ping 1.1` as a quick connectivity test.
    [0] https://man7.org/linux/man-pages/man3/inet_aton.3.html#DESCR...
    ta12437 months ago
    Far quicker to type ping 1.1 than ping 1.1.1.1
    1.0.0.0/24 is a different network than 1.1.1.0/24 too, so can be hosted elsewhere. Indeed right now 1.1.1.1 from my laptop goes via 141.101.71.63 and 1.0.0.1 via 141.101.71.121, which are both hosts on the same LINX/LON1 peer but presumably from different routers, so there is some resilience there.
    Given DNS is about the easiest thing to avoid a single point of failure on I'm not sure why you would put all your eggs in a single company, but that seems to be the modern internet - centralisation over resilience because resilience is somehow deemed to be hard.
    yjftsjthsd-h7 months ago
    > Far quicker to type ping 1.1 than ping 1.1.1.1
    I guess. I wouldn't have thought it worthwhile for 4 chars, but yes.
    > 1.0.0.0/24 is a different network than 1.1.1.0/24 too, so can be hosted elsewhere.
    I thought anycast gave them that on a single IP, though perhaps this is even more resilient?
    darkwater7 months ago
    Not a network expert but anycast will give you different routes depending on where you are. But having 2 IPs will give you different routes to them from the same location. In this case since the error was BGP related, and they clearly use the same system to announce both IPs, both were affected.
    ta12437 months ago
    In the internet world you can't really advertise subnets smaller than a /24, so 1.1.1.1/32 isn't a route, it's via 1.1.1.0/24
    You can see they are separate routes, say looking at Telia's routing IP
    https://lg.telia.net/?type=bgp&router=fre-peer1.se&address=1...
    https://lg.telia.net/?type=bgp&router=fre-peer1.se&address=1...
    In this case they both are advertised from the same peer above, I suspect they usually are - they certainly come from the same AS, but they don't need to. You could have two peers with cloudflare with different weights for each /24
    tom13377 months ago
    Wasn’t it also because a lot of hotel / public routers used 1.1.1.1 for captive portals and therefore you couldn’t use 1.1.1.1?
    immibis7 months ago
    Because operating systems have two boxes for DNS server IP addresses, and Cloudflare wants to be in both positions.
- 0xbadcafebee7 months ago
  In general, the idea of DNS's design is to use the DNS resolver closest to you, rather than the one run by the largest company.
  That said, it's a good idea to specifically pick multiple resolvers in different regions, on different backbones, using different providers, and not use an Anycast address, because Anycast can get a little weird. However, this can lead to hard-to-troubleshoot issues, because DNS doesn't always behave the way you expect.
  - ben0x5397 months ago
    Isn't the largest company most likely to have the DNS resolver closest to me?
    fragmede7 months ago
    Your ISP should have a DNS revolver closer to you. "Should" doesn't necessarily mean faster, however.
    encom7 months ago
    In case of Denmark, ISP DNS also means censored. Of course it started with CP, as it always does, then expanded to copyrights, pharmaceuticals, gambling and "terrorism". Except for the occasional Linux ISO, I don't partake in any of these topics, but I'm opposed to any kind of censorship on principle. And naturally, this doesn't stop anyone, but politicians get to stand in front of television cameras and say they're protecting children and stopping terrorists.
    </soapbox>
    nullify887 months ago
    Not just that. ISPs are often subject to certain data retention laws. For Denmark (And other EU countries) that maybe 6 months to 2 years. And considering close ties with "9 eyes" means America potentially has access to my information anyway.
    Judging by Cloudflare's privacy policy, they hold less personally identifiable information than my ISP while offering EDNS and low latencies? Win, win, win.
    lxgr7 months ago
    I’ve had ISPs with a DNS server (configured via DHCP) farther away than 1.1.1.1 and 8.8.8.8.
    7 months ago
    undefined
    sschueller7 months ago
    No, your ISP can have a server closer before any external one.
  - dontTREATonme7 months ago
    What’s your recommendation for finding the dns resolver closest to me? I currently use 1.1 and 8.8, but I’m absolutely open to alternatives.
    LeoPanthera7 months ago
    The closest DNS resolver to you is the one run by your ISP.
    JdeBP7 months ago
    Actually, it's about 20cm from my left elbow, which is physically several orders of magnitude closer than anything run by my ISP, and logically at least 2 network hops closer.
    And the closest resolving proxy DNS server for most of my machines is listening on their loopback interface. The closest such machine happens to be about 1m away, so is beaten out of first place by centimetres. (-:
    It's a shame that Microsoft arbitrarily ties such functionality to the Server flavour of Windows, and does not supply it on the Workstation flavour, but other operating systems are not so artificially limited or helpless; and even novice users on such systems can get a working proxy DNS server out of the box that their sysops don't actually have to touch.
    The idea that one has to rely upon an ISP, or even upon CloudFlare and Google and Quad9, for this stuff is a bit of a marketing tale that is put about by thse self-same ISPs and CloudFlare and Google and Quad9. Not relying upon them is not actually limited to people who are skilled in system operation, i.e. who they are; but rather merely limited by what people run: black box "smart" tellies and whatnot, and the Workstation flavour of Microsoft Windows. Even for such machines, there's the option of a decent quality router/gateway or simply a small box providing proxy DNS on the LAN.
    In my case, said small box is roughly the size of my hand and is smaller than my mass-market SOHO router/gateway. (-:
    lxgr7 months ago
    Is that really a win in terms of latency, considering that the chance of a cache hit increases with the number of users?
    vel0city7 months ago
    I used to run unbound at home as a full resolver, and ultimately this was my reason to go back to forwarding to other large public resolvers. So many domains seemed to be pretty slow to get a first query back, I had all kinds of odd behaviors from devices around the house getting a slow initial connection.
    Changed back to just using big resolvers and all those issues disappeared.
    0xbadcafebee7 months ago
    Keep in mind that low latency is a different goal than reliability. If you want the lowest-latency, the anycast address of a big company will often win out, because they've spent a couple million to get those numbers. If you want most reliable, then the closest hop to you should be the most reliable (there's no accounting for poor sysadmin'ing), which is often the ISP, but sometimes not.
    If you run your own recursive DNS server (I keep forgetting to use the right term) on a local network, you can hit the root servers directly, which makes that the most reliable possible DNS resolver. Yes you might get more cache misses initially but I highly doubt you'd notice. (note: querying the root nameservers is bad netiquette; you should always cache queries to them for at least 5 minutes, and always use DNS resolvers to cache locally)
    lxgr7 months ago
    > If you want most reliable, then the closest hop to you should be the most reliable (there's no accounting for poor sysadmin'ing), which is often the ISP, but sometimes not.
    I'd argue that accounting for poorly managed ISP resolvers is a critical part of reasoning about reliability.
    JdeBP7 months ago
    It is. If latency were important, one could always aggregate across a LAN with forwarding caching proxies pointing to a single resolving caching proxy, and gain economies of scale by exactly the same mechanisms. But latency is largely a wood-for-the-trees thing.
    In terms of my everyday usage, for the past couple of decades, cache miss delays are largely lost in the noise of stupidly huge WWW pages, artificial service greylisting delays, CAPTCHA delays, and so forth.
    Especially as the first step in any full cache miss, a back-end query to the root content DNS server, is also just a round-trip over the loopback interface. Indeed, as is also the second step sometimes now, since some TLDs also let one mirror their data. Thank you, Estonia. https://news.ycombinator.com/item?id=44318136
    And the gains in other areas are significant. Remember that privacy and security are also things that people want.
    Then there's the fact that things like Quad9's/Google's/CloudFlare's anycasting surprisingly often results in hitting multiple independent servers for successive lookups, not yielding the cache gains that a superficial understanding would lead one to expect.
    Just for fun, I did Bender's test at https://news.ycombinator.com/item?id=44534938 a couple of days ago, in a loop. I received reset-to-maximum TTLs from multiple successive cache misses, on queries spaced merely 10 seconds apart, from all three of Quad9, Google Public DNS, and CloudFlare 1.1.1.1. With some maths, I could probably make a good estimate as to how many separate anycast caches on those services are answering me from scratch, and not actually providing the cache hits that one would naïvely think would happen.
    I added 127.0.0.1 to Bender's list, of course. That had 1 cache miss at the beginning and then hit the cache every single time, just counting down the TTL by 10 seconds each iteration of the loop; although it did decide that 42 days was unreasonably long, and reduced it to a week. (-:
    baobabKoodaa7 months ago
    Windows 11 doesn't allow using that combination
  - AStonesThrow7 months ago
    [dead]
- codingminds7 months ago
  Wasn't that the case since ever?
- globular-toast7 months ago
  In general there's no such thing as "DNS backup". Most clients just arbitrarily pick one from the list, they don't fall back to the other one in case of failure or anything. So if one went down you'd still find many requests timing out.
  - JdeBP7 months ago
    The reality is that it's rather complicated to say what "most clients" do, as there is some behavioural variation amongst the DNS client libraries when they are configured with multiple IP addresses to contact. So whilst it's true to say that fallback and redundancy does not always operate as one might suppose at the DNS client level, it is untrue to go to the opposite extreme and say that there's no such thing at all.
- bigiain7 months ago
  I mean, aren't we already?
  My Pi-holes both use OpenDNS, Quad9, and CloudFlare for upstream.
  Most of my devices use both of my Pi-holes.
  - johnklos7 months ago
    If you're already running Pi-hole, wny not just run your own recursive, caching resolver?
homebrewer7 months ago
This is a good time to mention that dnsmasq lets you setup several DNS servers, and can race them. The first responder wins. You won't ever notice one of the services being down:
```
  all-servers
  server=8.8.8.8
  server=9.9.9.9
  server=1.1.1.1
```
- anthonyryan17 months ago
  Additionally, as long as you don't set strict-order, dnsmasq will automatically use all-servers for retries.
  If you were using systemd-resolved however, it retries all servers in the order they were specified, so it's important to interleave upstreams.
  Using the servers in the above example, and assuming IPv4 + IPv6:
  1.1.1.1 2001:4860:4860::8888 9.9.9.9 2606:4700:4700::1111 8.8.8.8 2620:fe::fe 1.0.0.1 2001:4860:4860::8844 149.112.112.112 2606:4700:4700::1001 8.8.4.4 2620:fe::9
  will failover faster and more successfully on systemd-resolved, than if you specify all Cloudflare IPs together, then all Google IPs, etc.
  Also note that Quad9 is default filtering on this IP while the other two or not, so you could get intermittent differences in resolution behavior. If this is a problem, don't mix filtered and unfiltered resolvers. You definitely shouldn't mix DNSSEC validatng and not DNSSEC validating resolvers if you care about that (all of the above are DNSSEC validating).
  - matthewtse7 months ago
    wow good tip
    I was handling an incident due to this outage. I ended up adding Google DNS resolvers using systemd-resolved, but I didn't think to interleave them!
- whitehexagon7 months ago
  That sounds good in principle, but is there a more private configuration that doesnt send DNS resolutions to cloudfare, google et al. ie. avoid BigTech tracking, and not wanting DOH.
  dnsmasq with a list of smaller trusted DNS providers sounds perfect, as long as it is not considered bad etiquette to spam multiple DNS providers for every resolution?
  But where to find a trusted list of privacy focused DNS resolvers. The couple I tried from random internet advice seemed unstable.
  - agolliver7 months ago
    There are no good private DNS configurations, but if you don't trust the big caching recursive resolvers then I'd consider just running your own at home. Unbound is easy to set up and you'll probably never notice a speed difference.
    hdgvhicv7 months ago
    I trust my isp far more than I trust cloudflare and google
    bagels7 months ago
    Why? Some were injecting ads, blocking services, degrading video and other wrongdoings.
    sanxiyn7 months ago
    Maybe their ISPs don't do that. There are many ISPs on the Earth.
    hdgvhicv7 months ago
    Mine doesn’t do that, mine is very transparent about what they do, what they will support, what laws they have to follow, what guidelines they can ignore, what logging they do, and if I have issues I jump on IRC and talk to them.
    If I have issues with cloudflare what do I do?
  - mcpherrinm7 months ago
    I've reviewed the privacy policy and performance of various DoH servers, and determined in my opinion that Cloudflare and Google both provide privacy-respecting policies.
    I believe that they follow their published policies and have reasonable security teams. They're also both popular services, which mitigates many of the other types of DNS tracking possible.
    https://developers.google.com/speed/public-dns/privacy https://developers.cloudflare.com/1.1.1.1/privacy/public-dns...
  - hamandcheese7 months ago
    NextDNS. Generous free tier, very affordable paid tier. Happy customer for several years and I've never noticed an outage.
    malnourish7 months ago
    Likewise; they make it easy to use across my devices, each with bespoke configuration.
    Melatonic7 months ago
    This
  - Yeri7 months ago
    https://www.dns0.eu/ is an option
  - bsilvereagle7 months ago
    I haven’t had any problems with OpenNIC: https://opennic.org/
    > OpenNIC (also referred to as the OpenNIC Project) is a user owned and controlled top-level Network Information Center offering a non-national alternative to traditional Top-Level Domain (TLD) registries; such as ICANN.
  - paradao7 months ago
    Using DNSCrypt with anonymized DNS could be an option: https://github.com/DNSCrypt/dnscrypt-proxy/wiki/Anonymized-D...
  - Tmpod7 months ago
    Quad9 and NextDNS are usually thrown around.
  - sophacles7 months ago
    You can just run unbound or similar and do your own recursive resolving.
  - localtoast7 months ago
    dnsforge.de comes to mind.
- karmakaze7 months ago
  I don't consider these interchangeable. They have different priorities and policies. If anything I'd choose one and use my ISP default as fallback.
  - outworlder7 months ago
    My ISP (one of the largest in the US) like to hijack DNS responses (specially NXDOMAIN) and serve crap. No thanks. Which is also why I have to use encryption to talk to public DNS servers otherwise they will hijack anyways.
  - eli7 months ago
    My ISP has already been caught selling personally identifiable customer data. I trust them less than any of those companies.
    sumtechguy7 months ago
    My ISP one got kicked to the curb once they started returning results for anything including invalid sites. Basically to try to steer you towards their search.
  - nemonemo7 months ago
    Agreed in principle, but has anyone seen any practical difference between these DNS services? What would be a more detailed downside for using these in parallel instead of the ISP default as a fallback?
    astrange7 months ago
    Some of them are so privacy-preserving they block sending your own location to the original DNS server, which makes anycast not work, so you get slower connections to the site.
- mnordhoff7 months ago
  Even without "all-servers", DNSMasq will race servers frequently (after 20 seconds, unless it's changed), and when retrying. A sudden outage should only affect you for a few seconds, if at all.
- karel-3d7 months ago
  dnsdist is AMAZINGLY easy to set up as a secure local resolver that forwards all queries to DoH (and checks SSL) and checks liveliness every second
  I need to do a write-up one day
  - jzebedee7 months ago
    Please do. I'd be curious what a secure-by-default self hosted resolver would look like.
    daneel_w7 months ago
    For what it may be worth, here's a most basic (but fully working) config for running Unbound as a DoT-only forwarder:
    server: logfile: "" log-queries: no # adjust as necessary interface: 127.0.0.1@53 access-control: 127.0.0.0/8 allow infra-keep-probing: yes tls-system-cert: yes forward-zone: name: "." forward-tls-upstream: yes forward-addr: 9.9.9.9@853#dns.quad9.net forward-addr: 193.110.81.9@853#zero.dns0.eu forward-addr: 149.112.112.112@853#dns.quad9.net forward-addr: 185.253.5.9@853#zero.dns0.eu
- heavyset_go7 months ago
  I think systemd-resolved does something similar if you use that. Does DoT and DNSSEC by default.
  If you want to eschew centralized DNS altogether, if you run a Tor daemon, it has an option to expose a DNS resolver to your network. Multiple resolvers if you want them.
- xyst7 months ago
  Probably great for users. Awful for trying to reproduce an issue. I prefer a more deterministic approach myself.
- itscrush7 months ago
  Looks like AdGuard allows for same, thanks for mentioning dnsmasq support! I overlooked it on setup.
Mindless21127 months ago
Interesting that traffic didn't return to completely normal levels after the incident.
I recently started using the "luci-app-https-dns-proxy" package on OpenWrt, which is preconfigured to use both Cloudflare and Google DNS, and since DoH was mostly unaffected, I didn't notice an outage. (Though if DoH had been affected, it presumably would have failed over to Google DNS anyway.)
- caconym_7 months ago
  > Interesting that traffic didn't return to completely normal levels after the incident.
  Anecdotally, I figured out their DNS was broken before it hit their status page and switched my upstream DNS over to Google. Haven't gotten around to switching back yet.
  - radicaldreamer7 months ago
    What would be a good reason to switch back from Google DNS?
    Algent7 months ago
    After trying both several time I since stayed with google due to cloudflare always returning really bad IPs for anything involving CDN. Having users complain stuff take age to load because you got matched to an IP on opposite side of planet is a bit problematic especially when it rarely happen on other dns providers. Maybe there is a way to fix this but I admit I went for the easier option of going back to good old 8.8.8.8
    homebrewer7 months ago
    No, it's deliberately not implemented:
    https://developers.cloudflare.com/1.1.1.1/faq/#does-1111-sen...
    I've also changed to 9.9.9.9 and 8.8.8.8 after using 1.1.1.1 for several years because connectivity here is not very good, and being connected to the wrong data center means RTT in excess of 300 ms. Makes the web very sluggish.
    Aachen7 months ago
    Does that setup fall back to 8.8.8.8 if 9.9.9.9 fails to resolve?
    Quad9 has a very aggressive blocking policy (my site with user-uploaded content was banned without even reporting the malicious content; if you're a big brand name it seems to be fine to have user-uploaded content though) which this would be a possible workaround for, but it may not take an nxdomain response as a resolver failure
    sammy22557 months ago
    Depends who you trust more with your DNS traffic. I know who I trust more.
    nojs7 months ago
    Who? Honest question
    immibis7 months ago
    Myself, I suppose? Recursive resolvers are low-maintenance, and you get less exposure to ISP censorship (which "developed" countries also do).
    Elucalidavah7 months ago
    Realistically, either you ignore the privacy concerns and set up routing to multiple providers preferring the fastest, or you go all-in on privacy and route DNS over Tor over bridge.
    Although, perhaps, having an external VPS with a dns proxy could be a good middle ground?
    daneel_w7 months ago
    If you're the technical type you can run Unbound locally (even on Windows) and let it forward queries with DoT. No need for neither Tor nor running your own external resolver.
    Tijdreiziger7 months ago
    Middle ground is ISP DNS, right?
    sumeno7 months ago
    If privacy is your primary concern I would 100% trust Cloudflare or Google over an ISP in the US
    Tijdreiziger7 months ago
    I’m in the Netherlands.
    daneel_w7 months ago
    Quad9, dns0.
    misiek087 months ago
    Google is serving you ads, CF isn’t.
    And it’s not conspiracy theory - it was very suspicious when we did some testing on small, aware group. The traffic didn’t look like being handled anonymously at Google side
    mnordhoff7 months ago
    Unless the privacy policy changed recently, Google shouldn't be doing anything nefarious with 8.8.8.8 DNS queries.
    daneel_w7 months ago
    Yeah it's not like they have a long track record of being caught red-handed stepping all over privacy regulations and snarfing up user activity data across their entire range of free products...
    DarkCrusader27 months ago
    They weren't supposed to do anything with our gmail data as well. That didn't stop them.
    Tijdreiziger7 months ago
    [citation needed]
    johnklos7 months ago
    Read their TOS.
    Tijdreiziger7 months ago
    If it’s in the ToS, then it’s not true that “[they] weren't supposed to do anything with our gmail data”.
    opan7 months ago
    CF breaks half the web with their awful challenges that fail in many non-mainstream browsers (even ones based on chromium).
- anon70007 months ago
  They go into that more towards the end, sounds like some smaller % of servers needed more direct intervention
- bastawhiz7 months ago
  If your Internet doesn't work, you'll get up and do other things for a while. I strongly suspect most folks didn't switch DNS providers in that time.
- motorest7 months ago
  > Interesting that traffic didn't return to completely normal levels after the incident.
  Clients cache DNS resolutions to avoid having to do that request each time they send a request. It's plausible that some clients held on to their cache for a significant period.
perlgeek7 months ago
An outage of roughly 1 hour is 0.13% of a month or 0.0114% of a year.
It would be interesting to see the service level objective (SLO) that cloudflare internally has for this service.
I've found https://www.cloudflare.com/r2-service-level-agreement/ but this seems to be for payed services, so this outage would put July in the "< 99.9% but >= 99.0%" bucket, so you'd get a 10% refund for the month if you payed for it.
- philipwhiuk7 months ago
  Probably 99.9% or better annually just from a 'maintaining reputation for reliability' standpoint.
  - stingraycharles7 months ago
    What really matters with these percentages is whether it’s per month or per year. 99.9% per year allows for much longer outages than 99.9% per month.
alyandon7 months ago
```
  Cloudflare's 1.1.1.1 Resolver service became unavailable to the Internet starting at 21:52 UTC and ending at 22:54 UTC
```
Weird. According to my own telemetry from multiple networks they were unavailable for a lot longer than that.
aftbit7 months ago
>Even though this release was peer-reviewed by multiple engineers
I find it somewhat surprising that none of the multiple engineers who reviewed the original change in June noticed that they had added 1.1.1.0/24 to the list of prefixes that should be rerouted. I wonder what sort of human mistake or malice led to that original error.
Perhaps it would be wise to add some hard-coded special-case mitigations to DLS such that it would not allow 1.1.1.1/32 or 1.0.0.1/32 to be reassigned to a single location.
- burnte7 months ago
  It's probably much simpler, "I trust Jerry, I'm sure this is fine, approved."
- roughly7 months ago
  I’m generally more a “blame the tools” than “blame the people” - depending on how the system is set up and how the configs are generated, it’s easy for a change like this to slip by - especially if a bunch of the diff is autogenerated. It’s still humans doing code review, and this kind of failure indicates process problems, regardless of whether or not laziness or stupidity were also present.
  But, yes, a second mitigation here would be defense in depth - in an ideal world, all your systems use the same ops/deploy/etc stack, in this one, you probably want an extra couple steps in the way of potentially taking a large public service offline.
chrisgeleven7 months ago
I been lazy and was using Cloudflare's resolver only recently. In hindsight I probably should just setup two instances of Unbound on my home network that don't rely on upstream resolvers and call it a day. It's unlikely both will go down at the same time and if I'm having an total Internet outage (unlikely as I have Comcast as primary + T-Mobile Home Internet as a backup), it doesn't matter if DNS is or isn't resolving.
nodesocket7 months ago
I used to configure 1.1.1.1 as primary and 8.8.8.8 as secondary but noticed that Cloudflare on aggregate was quicker to respond to queries and changed everything to use 1.1.1.1 and 1.0.0.1. Perhaps I'll switch back to using 8.8.8.8 as secondary, though my understanding is DNS will round-robin between primary and secondary, it's not primary and then use secondary ONLY if primary is down. Perhaps I am wrong though.
EDIT: Appears I was wrong, it is failover not round-robin between the primary and secondary DNS servers. Thus, using 1.1.1.1 and 8.8.8.8 makes sense.
- ta12437 months ago
  Depends on how you configure it. In resolv.conf systems for example you can set a timeout of say 1 second and do it as main/reserve, or set it up to round-robin. From memory it's something like "options:rotate"
  If you have a more advanced local resolver of some sort (systemd for example) you can configure whatever behaviour you want.
wreckage6457 months ago
This is a good post mortem, but improvements only come with change on processes. It seems every team at CloudFlare is approaching this in isolation, without a central problem management. Every week we see a new CloudFlare global outage. It seems like the change management processes is broken and needs to be looked at..
neurostimulant7 months ago
I never noticed the outage because my isp hijack all outbound udp traffic to port 53 and redirect them to their own dns server so they can apply government-mandated cencorship :)
angst7 months ago
I wonder how uptime ratio of 1.1.1.1 is against 8.8.8.8
Maybe there is noticeable difference?
I have seen more outage incident reports of cloudflare than of google, but this is just personal anecdote.
- Pharaoh27 months ago
  https://www.dnsperf.com/#!dns-resolvers
  Last 30 days, 8.8.8.8 has 99.99% uptime vs 1.1.1.1 has 99.09%
- ta12437 months ago
  I guess it depends on where you are and what you count as an outage. Is a single failed query an outage?
  For me cloudflare 1.1.1.1 and 1.0.0.1 have a mean response time of 15.5ms over the last 3 months, 8.8.8.8 and 8.8.4.4 are 15.0ms, and 9.9.9.9 is 13.8ms.
  All of those servers return over 3-nines of uptime when quantised in the "worst result in a given 1 minute bucket" from my monitoring points, which seem fine to have in your mix of upstream providers. Personally I'd never rely on a single provider. Google gets 4 nines, but that's only over 90 days so I wouldn't draw any long term conclusions.
cadamsdotcom7 months ago
> The way that Cloudflare manages service topologies has been refined over time and currently consist of a combination of a legacy and a strategic system that are synced.
This writing is just brilliant. Clear to technical and non-technical readers. Makes the in-progress migration sound way more exciting than it probably is!
> We are sorry for the disruption this incident caused for our customers. We are actively making these improvements to ensure improved stability moving forward and to prevent this problem from happening again.
This is about as good as you can get it from a company as serious and important as Cloudflare. Bravo to the writers and vetters for not watering this down.
- kccqzy7 months ago
  I can't tell if you are being sarcastic, but "legacy" is a term most often used by technical people whereas "strategic" is a term most often used by marketing and non-technical leadership. Mixing them together annoys both kinds of readers.
  - marcusb7 months ago
    You cannot throw a rock without hitting a product marketer describing everything not-their-product as "legacy."
  - 7 months ago
    undefined
  - hbay7 months ago
    you were annoyed by that sentence?
0xbadcafebee7 months ago
> A configuration change was made for the same DLS service. The change attached a test location to the non-production service; this location itself was not live, but the change triggered a refresh of network configuration globally.
Say what now? A test triggered a global production change?
> Due to the earlier configuration error linking the 1.1.1.1 Resolver's IP addresses to our non-production service, those 1.1.1.1 IPs were inadvertently included when we changed how the non-production service was set up.
You have a process that allows some other service to just hoover up address routes already in use in production by a different service?
dawnerd7 months ago
Oh this explains a lot. I kept having random connection issues and when I disabled AdGuard dns (self hosted) it started working so I just assumed it was something with my vm.
alexandrutocar7 months ago
> It’s worth noting that DoH (DNS-over-HTTPS) traffic remained relatively stable as most DoH users use the domain cloudflare-dns.com, configured manually or through their browser, to access the public DNS resolver, rather than by IP address.
I use their DNS over HTTPS and if I hadn't seen the issue being reported here, I wouldn't have caught it at all. However, this—along with a chain of past incidents (including a recent cascading service failure caused by a third-party outage)—led me to reduce my dependencies. I no longer use Cloudflare Tunnels or Cloudflare Access, replacing them with WireGuard and mTLS certificates. I still use their compute and storage, but for personal projects only.
nu11ptr7 months ago
Question: Years ago, back when I used to do networking, Cisco Wireless controllers used 1.1.1.1 internally. They seemed to literally blackhole any comms to that IP in my testing. I assume they changed this when 1.0.0.0/8 started routing on the Internet?
- blurrybird7 months ago
  Yeah part of the reason why APNIC granted Cloudflare access to those very lucrative IPs is to observe the misconfiguration volume.
  The theory is CF had the capacity to soak up the junk traffic without negatively impacting their network.
- yabones7 months ago
  The general guidance for networking has been to only use IPs and domains that you actually control... But even 5-8 years ago, the last time I personally touched a cisco WLC box, it still had 1.1.1.1 hardcoded. Cisco loves to break their own rules...
sneak7 months ago
1.1.1.1 does not operate in isolation.
It is designed to be used in conjunction with 1.0.0.1. DNS has fault tolerance built in.
Did 1.0.0.1 go down too? If so, why were they on the same infrastructure?
This makes no sense to me. 8.8.8.8 also has 8.8.4.4. The whole point is that it can go down at any time and everything keeps working.
Shouldn’t the fix be to ensure that these are served out of completely independent silos and update all docs to make sure anyone using 1.1.1.1 also has 1.0.0.1 configured as a backup?
If I ran a service like this I would regularly do blackouts or brownouts on the primary to make sure that people’s resolvers are configured correctly. Nobody should be using a single IP as a point of failure for their internet access/browsing.
- detaro7 months ago
  You don't need to test if peoples resolvers handle this cleanly, because its already known that many don't. DNS fallback behavior across platforms is a mess.
- notpushkin7 months ago
  > Did 1.0.0.1 go down too?
  Yes.
  > Shouldn’t the fix be to ensure that these are served out of completely independent silos [...]?
  Yes.
  > If so, why were they on the same infrastructure?
  Apparently, they weren’t independent enough: something in CF has announced both addresses and that got out.
  The solution for the end user is, of course, to use 1.1.1.1 and 8.8.8.8 (or any other combination of two different resolvers).
nness7 months ago
Interesting side-effect, the Gluetun docker image uses 1.1.1.1 for DNS resolution — as a result of the outage Gluetun's health checks failed and the images stopped.
If there were some way to view torrenting traffic, no doubt there'd be a 20 minute slump.
- johnklos7 months ago
  Personally, I'd consider any Docker image that does its own DNS resolution outside of the OS a Trojan.
7 months ago
undefined
udev40967 months ago
This is why running your own resolver is so important. Clownflare will always break something or backdoor something
zac23or7 months ago
It's no surprise that Cloudflare is having a service issue again.
I use Cloudflare at work. Cloudflare has many bugs, and some technical decisions are absurd, such as the worker's cache.delete method, which only clears the cache contents in the data center where the Worker was invoked!!! https://developers.cloudflare.com/workers/runtime-apis/cache...
In my experience, Cloudflare support is not helpful at all, trying to pass the problem onto the user, like "Just avoid holding it in that way. ".
At work, I needed to use Cloudflare. The next job I get, I'll put a limit on my responsibilities: I don't work with Cloudflare.
I will never use Cloudflare at home and I don't recommend it to anyone.
Next week: A new post about how Cloudflare saved the web from a massive DDOS attack.
- kentonv7 months ago
  > some technical decisions are absurd, such as the worker's cache.delete method, which only clears the cache contents in the data center where the Worker was invoked!!!
  The Cache API is a standard taken from browsers. In the browser, cache.delete obviously only deletes that browser's cache, not all other browsers in the world. You could certainly argue that a global purge would be more useful in Workers, but it would be inconsistent with the standard API behavior, and also would be extraordinarily expensive. Code designed to use the standard cache API would end up being much more expensive than expected.
  With all that said, we (Workers team) do generally feel in retrospect that the Cache API was not a good fit for our platform. We really wanted to follow standards, but this standard in this case is too specific to browsers and as a result does not work well for typical use cases in Cloudflare Workers. We'd like to replace it with something better.
  - zac23or7 months ago
    >cache.delete obviously only deletes that browser's cache, not all other browsers in the world.
    To me, it only makes sense if the put method creates a cache only in the datacenter where the Worker was invoked. Put and delete need to be related, in my opinion.
    Now I'm curious: what's the point of clearing the cache contents in the datacenter where the Worker was invoked? I can't think of any use for this method.
    My criticisms aren't about functionality per see or developers. I don't doubt the developers' competence, but I feel like there's something wrong with the company culture.
    kentonv7 months ago
    > To me, it only makes sense if the put method creates a cache only in the datacenter where the Worker was invoked. Put and delete need to be related, in my opinion.
    That is, in fact, how it works. cache.put() only writes to the local datacenter's cache. If delete() were global, it would be inconsistent with put().
    > Now I'm curious: what's the point of clearing the cache contents in the datacenter where the Worker was invoked? I can't think of any use for this method.
    Say you read the cache entry but you find, based on its content, that it is no longer valid. You would then want to delete it, to save the cost of reading it again later.
    zac23or7 months ago
    > cache.put() only writes the local datacenter's cache.
    Thanks, I didn't know that (I don't remember reading it in the documentation)
  - freedomben7 months ago
    Just wanted to say, I always appreciate your comments and frankness!
- freedomben7 months ago
  Cloudflare is definitely not perfect (and when they make a change that breaks the existing API contract it always makes for several miserable days for me), but on the whole Cloudflare is pretty reliable.
  That said, I don't use workers and don't plan to. I personally try to stay away from non cross-platform stuff because I've been burned too heavily with vendor/platform lock-in in the past.
  - kentonv7 months ago
    > and when they make a change that breaks the existing API contract it always makes for several miserable days for me
    If we changed an API in Workers in a way that broke any Worker in production, we consider that an incident and we will roll it back ASAP. We really try to avoid this but sometimes it's hard for us to tell. Please feel free to contact us if this happens in the future (e.g. file a support ticket or file a bug on workerd on GitHub or complain in our Discord or email kenton@cloudflare.com).
    freedomben7 months ago
    Thank you! To clarify it's been API contracts in the DNS record setting API that have hit me. I'm going from memory here and it's been a couple years I think so might be a bit rusty, but one example was a slight change in data type acceptance for TTL on a record. It used to take either a string or integer in the JSON but at some point started rejecting integers (or strings, whichever one I was sending at the time stopped being accepted) so the API calls were suddenly failing (to be fair that might not have technically been a violation of the contract, but it was a change in behavior that had been consistent for years and which I would not have expected). Another one was regarding returning zone_id for records where the zone_id stopped getting populated in the returned record. Luckily my code already had the zone_id because it needs that to build the URL path, but it was a rough debugging session and then I had to hack around it by either re-adding the zone ID to the returned record or removing zone ID from my equality check, neither of which were preferred solutions.
    If we start using workers though I'll definitely let you know if any API changes!
    kentonv7 months ago
    Hmm, I am not familiar with the DNS API but those certainly sound like changes that I'd argue should not have been made. :/
b0rbb7 months ago
I don't know about you all but I love a well written RCA. Nicely done.
geoffpado7 months ago
This was quite annoying for me, having only switched my DNS server to 1.1.1.1 approximately 3 weeks ago to get around my ISP having a DNS outage. Is reasonably stable DNS really so much to ask for these days?
- codingminds7 months ago
  If you consume a service that's free of charge, it's at least not reasonable to complain if there's an outage.
  Like mentioned by other comments, do it on your own if you are not happy with the stability. Or just pay someone to provide it - like your ISP..
  And TBH I trust my local ISP more than Google or CF. Not in availability, but it's covered by my local legislature. That's a huge difference - in a positive way.
  - chii7 months ago
    > it's covered by my local legislature
    which might not be a good thing in some jurisdictions - see the porn block in the UK (it's done via dns iirc, and trivially bypassed with a third party dns like cloudflare's).
    codingminds7 months ago
    Yeah it has its pros and cons, sadly.
    So far I'm lucky and the only ban I'm aware of is on gambling. Which is fine for me personally.
    But in a UK case I'd using a non local one as well.
  - komali27 months ago
    > it's at least not reasonable to complain if there's an outage.
    I don't think this is fair when discussing infrastructure. It's reasonable to complain about potholes, undrinkable tap water, long lines at the DMV, cracked (or nonexistent) sidewalks, etc. The internet is infrastructure and DNS resolution is a critical part of it. That it hasn't been nationalized doesn't change the fact that it's infrastructure (and access absolutely should be free) and therefore everyone should feel free to complain about it not working correctly.
    "But you pay taxes for drinkable tap water," yes, and we paid taxes to make the internet work too. For some reason, some governments like the USA feel it to be a good idea to add a middle man to spend that tax money on, but, fine, we'll complain about the middle man then as well.
    gkbrk7 months ago
    But you can just run a recursive resolver. Plenty of packages to install. The root DNS servers were not affected, so you would have been just fine.
    DNS is infrastructure. But "Cloudflare Public Free DNS Resolver" is not, it's just a convenience and a product to collect data.
    JdeBP7 months ago
    One can even run a private root content DNS server, and not be affected by root problems either.
    (This isn't a major concern, of course; and I mention it just to extend your argument yet further. The major gain of a private root content DNS server is the fraction of really stupid nonsense DNS traffic that comes about because of various things gets filtered out either on-machine or at least without crossing a border router. The gains are in security and privacy more than uptime.)
    delfinom7 months ago
    >That it hasn't been nationalized doesn't change the fact that it's infrastructure (and access absolutely should be free) and therefore everyone should feel free to complain about it not working correctly.
    >"But you pay taxes for drinkable tap water," yes, and we paid taxes to make the internet work too. For some reason, some governments like the USA feel it to be a good idea to add a middle man to spend that tax money on, but, fine, we'll complain about the middle man then as well.
    You don't want DNS to be nationalized. Even the US would have half the internet banned by now.
    codingminds7 months ago
    You are right infrastructure is important.
    But opposite to tap water there are a lot of different free DNS resolvers that can be used.
    And I don't see how my taxes funded CFs DNS service. But my ISP fee covers their DNS resolving setup. That's the reason why I wrote
    > a service that's free of charge
    Which CF is.
    komali27 months ago
    DNS shouldn't be privatized at all since it's a critical part of internet infrastructure, however at the same time the idea that somehow it's something a corporation should be allowed to sell to you at all (or "give you for free") is silly given that the service is meaningless without the infrastructure of the internet, which is built by governments (through taxes). I can't even think of an equivalent it's so ridiculous that it's allowed at all, my best guess would be maybe, if your landlord was allowed to charge you for walking on the sidewalk in front of the apartment or something.
    codingminds7 months ago
    DNS is not privatized. This is not about the root DNS servers, it's just about one of many free resolvers out there - in this case one of the bigger and popular ones.
- bauruine7 months ago
  Why not use multiple? You can use 1.1.1.1, your ISPs and google at the same time. Or just run a resolver yourself.
  - ripdog7 months ago
    >Or just run a resolver yourself.
    I did this for a while, but ~300ms hangs on every DNS resolution sure do get old fast.
    xpe7 months ago
    Ouch. What resolver? What hardware?
    With something like a N100- or N150-based single board computer (perhaps around $200) running any number of open source DNS resolvers, I would expect you can average around 30 ms for cold lookups and <1 ms for cache hits.
    ripdog7 months ago
    Not a hardware issue, but a physics problem. I live in NZ. I guess the root servers are all in the US, so that's 130ms per trip minimum.
    bauruine7 months ago
    The root servers aren't the problem. They are heavily anycasted and i'm sure there are many in .nz. If that was the issue you could simply serve the root zone yourself, at least some of them allow axfr. [0] This info is also easy cacheable, they have big TTLs and you only have to do it once for each tld. The authoritative name server of the domain you want to access on the other hand are often just in the US or Europe and are the main issue.
    Edit: How to serve the root zone locally with unbound. https://old.reddit.com/r/pihole/comments/s43o8j/where_does_u...
    [0] dig axfr . @k.root-servers.net
    ripdog7 months ago
    Thank you for the correction, I did get that wrong. To be clear, there was no easy solution to get reliable, low latency DNS responses from my own resolver without breaking keepalive by forcibly caching entries longer?
    bauruine7 months ago
    Not that I know of except from having a big cache and many users that keeps it warm. As I said you could run a local root zone but that only saves you the one time lookup every week+ of the tld name servers and the root servers are generally very close to you. There is a map of all root servers. There are 12 in .nz alone. A few cc tlds are providing their zone via axfr [1] so you could add that to your resolver to save some roundtrips but I don't think having .ch or .se locally will make a big difference and they are 1.2GB each and you would need to download them daily.
    [0]: https://root-servers.org/ [1]: https://github.com/jschauma/tld-zoneinfo
    johnklos7 months ago
    They are not all in the US.
    passivegains7 months ago
    I was going to reply about how New Zealand is as far from almost everywhere else as the US, but I found out something way more interesting: Other than servers in Australia and New Zealand itself, the closest ones actually are in the US, just 3,000km north in American Samoa. Basically right next door. (I need to go back to work before my boss walks by and sees me screwing around on Google Maps, but I'm pretty sure the next closest are in French Polynesia.)
    ripdog7 months ago
    Well that's the experience I had. Obviously caching was enabled (unbound), but most DNS keepalive times are so short as to be fairly useless for a single user.
    Even if a root server wasn't in the US, it will still be pretty slow for me. Europe is far worse. Most of Asia has bad paths to me, except for Japan and Singapore which are marginally better than the US. Maybe Aus has one...?
    janfoeh7 months ago
    According to [0], there is at least one in Auckland. No idea about the veracity of that site, though.
    [0] https://dnswatch.com/dns-docs/root-server-locations
    astrange7 months ago
    Cloudflare actually runs one of the root servers (https://blog.cloudflare.com/f-root/).
    encom7 months ago
    >DNS keepalive times are so short as to be fairly useless
    Incompetent admins. dnsmasq at least has an option to override it (--min-cache-ttl=<time>)
- pparanoidd7 months ago
  A single incident means 1.1.1.1 is no longer reasonably stable? You are the unreasonable one
  - yjftsjthsd-h7 months ago
    Although I agree 1.1.1.1 is fine: To this particular commenter they've had one major outage in 3 total weeks of use, which isn't exactly a good record. (And it's understandable to weigh personal experience above other people claiming this isn't representative.)
  - geoffpado7 months ago
    Two incidents from two completely different providers in three weeks means that my personal experience with DNS is remarkably less stable recently than the last 20-ish years I've been using the Internet.
    rthnbgrredf7 months ago
    Your personal experience is valuable but does not generalize in this case. I have 8.8.8.8 and 1.1.1.1 (failover) set up for ever never experienced an outage.
  - cryptonym7 months ago
    I have been online for 30y and can't remember being affected by downtime from my ISP DNS.
    When DNS resolver is down, it affects everything, 100% uptime is a fair expectation, hence redundancy. Looks like both 1.0.0.1 and 1.1.1.1 were down for more than 1h, pretty bad TBH, especially when you advise global usage.
    RCA is not detailed and feels like a marketing stunt we are now getting every other week.
    sophacles7 months ago
    I too have been online for 30 years, and frequently had ISP caused dns issues, even when I wasn't using their resolvers... because of the dns interception fuckery they like to engage in. Before I started running my own resolver I saw downtime from my ISP's DNS resolver. This is across a few ISPs in that time. Anecdata is great isn't it?
    cryptonym6 months ago
    I think we agree.
    Your own "anecdata" shows how catastrophic DNS failures are and why it's justified to expect 100% uptime with proper redundancy. Too bad you had providers not architecturing properly this key part of their infrastructure to the point you had to build your own solution.
- bjoli7 months ago
  Run your own forwarder locally. Technitium dns makes it easy.
trollbridge7 months ago
I got bit by this, so dnsmasq now has 1.1.1.2, Quad9, and Google’s 8.8.8.8 with both primary and secondary.
Secondary DNS is supposed to be in an independent network to avoid precisely this.
xyst7 months ago
Am not a fan of CF in general due to their role in centralization of the internet around their services.
But I do appreciate these types of detailed public incident reports and RCAs.
rswail7 months ago
I now run unbound locally as a recursive DNS server, which really should be the default. There's no reason not to in modern routers.
Not sure what the "advantage" of stub resolvers is in 2025 for anything.
nixpulvis7 months ago
Fun fact, Verizon cellular blocks 1.1.1.1. I discovered this after trying to use my hotspot from my Linux laptop with it set for my default DNS.
Very frustrating.
hkon7 months ago
To say I was surprised when I finally checked the status page of cloudflare is an understatement.
egamirorrim7 months ago
What's that about a hijack?
- homero7 months ago
  Related, non-causal event: BGP origin hijack of 1.1.1.0/24 exposed by withdrawal of routes from Cloudflare. This was not a cause of the service failure, but an unrelated issue that was suddenly visible as that prefix was withdrawn by Cloudflare.
  - ollien7 months ago
    I'm a bit uneducated here - why was the other 1.1.1.0/24 announcement previously suppressed? Did it just express a high enough cost that no one took it on compared to the CF announcement?
    whiatp7 months ago
    CF had their route covered by RPKI, which at a high level uses certs to formalize delegation of IP address space.
    What caused this specific behavior is the dilemma of backwards comparability when it comes to BGP security. We area long ways off from all routes being covered by rpki, (just 56% of v4 routes according to https://rpki-monitor.antd.nist.gov/ROV ) so invalid routes tend to be treated as less preferred, not rejected by BGP speakers that support RPKI.
  - JdeBP7 months ago
    And because people highlighted it on social media at the time of the outage, many thought that the bogus route was the cause of the problem.
  - kylestanfield7 months ago
    So someone just started advertising the prefix when it was up for grabs? That’s pretty funny
    woutifier7 months ago
    No they were already doing that, the global withdrawal of the legitimate route just exposed it.
    SemioticStandrd7 months ago
    How is there absolutely no further comment about that in their RCA? That seems like a pretty major thing...
greggsy7 months ago
I’d love to know legacy systems they’re referring to.
thunderbong7 months ago
How does Cloudflare compare with OpenDNS?
- blurrybird7 months ago
  You’d be better off comparing it to Quad9 based on performance, privacy claims, and response accuracy.
- johnklos7 months ago
  Cloudflare is a for-profit company in the US. Their privacy claims can't be believed. Even if we did believe them, we have no idea if rsolution data isn't taken by US TLA agencies.
  - forbiddenlake7 months ago
    Hm, what distinction are you trying to make here? OpenDNS is also an American company, acquired by Cisco (an American company) in 2015.
    johnklos7 months ago
    I don't know much about OpenDNS, but yes, I wouldn't trust Cisco to do anything that didn't somehow push money in their direction. I was just offering relevant information about Cloudflare.
  - johnklos7 months ago
    It seems we have a lot of Cloudflare fanbois and apologists here. This is not unexpected. But is anything I'm writing untrue, or just unpopular? Does anyone who's downvoting me care to point out any inaccuracies about what I've written?
    astrange7 months ago
    It's illegal for a US company to lie to their investors, so if you believe they're lying to you, you should sue them for securities fraud.
    johnklos7 months ago
    You actually believe that this is how it works?
    astrange7 months ago
    What do you mean "how it works"? It's up to you. You're the one who makes it work.
7 months ago
undefined
sylware7 months ago
cloudflare is providing a service designed to block noscript/basic (x)html browsers.
I know.
tacitusarc7 months ago
Perhaps I am over-saturated, but this write up felt like AI- at least largely edited by a model.