Don't you normally have 2 DnS servers listed on any device. So was the second also down, if not why didn't it go to that.
Unfortunately, the configuration mistake that caused this outage disabled Cloudflare's BGP advertisements of both 1.1.1.0/24 and 1.0.0.0/24 prefixes to its peers.
What are you trying to do on your wifi?
If you aren't using their DNS, then your network requests just get dropped (as you're not approved yet). You need their DNS to learn how to access their captive host so they can whitelist your mac address.
- frequently capture portals only permit access for 1-2hr. Your internet get cut off, then you have to realize its not a temporary issue, but portal issue, then you close the vpn, try to find the captive portal, and re-auth.
- latency is too high for my home vpn when I travel in asia
Btw, I really don't understand why it does not accept an IP (1.1.1.1), so you have to give an address (one.one.one.one). It would be more sensible to configure a DNS server from an IP rather than from an address to be resolved by a DNS server :/
Normal DNS can normally be changed in your connection settings for a given connection on most flavours of Android.
Yes, sorry, I did not mention it.
So if you want to use DNS over HTTPS on Android, it is not possible to provide a fallback.
Not true. If the (DoH) host has multiple A/AAAA records (multiple IPs), any decent DoH client would retry its requests over multiple or all of those IPs.
DoH hosts can resolve to multiple IPs (and even different IPs for different clients)?
Also see TFA
It's worth noting that DoH (DNS-over-HTTPS) traffic remained relatively stable as most DoH users use the domain cloudflare-dns.com, configured manually or through their browser, to access the public DNS resolver, rather than by IP address. DoH remained available and traffic was mostly unaffected as cloudflare-dns.com uses a different set of IP addresses.
Yes, but not from a different organization. That was GPs point with
> So if you want to use DNS over HTTPS on Android, it is not possible to provide a fallback.
A cross-organizational fallback is not possible with DoH in many clients, but it is with plain old DNS.
> It's worth noting that DoH (DNS-over-HTTPS) traffic remained relatively stable as most DoH users use the domain cloudflare-dns.com
Yes, but that has nothing to do with failovers to an infrastructurally/operationally separate secondary server.
That's client implementation lacking, not some issue inherent to DoH?
The DoH client is configured with a URI Template, which describes how to construct the URL to use for resolution. Configuration, discovery, and updating of the URI Template is done out of band from this protocol.
Note that configuration might be manual (such as a user typing URI Templates in a user interface for "options") or automatic (such as URI Templates being supplied in responses from DHCP or similar protocols). DoH servers MAY support more than one URI Template. This allows the different endpoints to have different properties, such as different authentication requirements or service-level guarantees.
https://datatracker.ietf.org/doc/html/rfc8484#section-3But I understand why Cloudflare can’t just say “use 8.8.8.8 as your backup”.
Which means that you’d be on cloudflare half the time and on google half the time which may not be what you wanted.
I haven't been able to find any recourse. The malware was online for a few hours but it has been weeks and there seems to be no way to clear my name. Someone on github (the website is open source) suggested that it's probably because they didn't know of the website, like everyone heard of wetransfer and github and so they don't get the whole domain blocked for malicious user content. I can't find any other difference, but also no responsible party to ask. The false-positive reporting tool on quad9's website just reloads the page and doesn't do anything
¹ I'm aware DNS can't do this, but with a direct way of contacting a very responsive admin (no captchas or annoying forms, just email), I'd not expect scanners to resort to blocking the domain outright to begin with, at least not after they heard back the first time and the problematic content has been cleared swiftly
Sometimes the upstream blocklist provider will be easy to contact directly as well. Sometimes not so much.
There was no ticket number yet because I was mainly trying to resolve it upstream (whoever made it get into uBlock's default block list, Quad9, and probably other places) and then today when I checked your site specifically, the link in "False Positive? <Please contact us>" (when you do a lookup for a blocked domain) just links back to itself so I couldn't open a case there either. Now that I look at the page again, with the advice in mind from a sibling comment to just email you, I now see that maybe this is supposed to go to the generic contact form and I needn't go through this domain status page. Opening the contact page now, I see that removal from blocklist is a selectable option so I'll use that :)
The ticket number I just submitted is 41905. Not that I'd want you to now apply preferential treatment, I didn't expect my post above to be seen by many people though I very much appreciate that you've reached out here. Makes me think you're actually interested in resolving this type of issue for small website operators, where the complete block without so much as a heads up felt a bit, well, like that might not get me anywhere. If the process just works as it normally should, that's good enough for me! Thanks for encouraging me to actually open a ticket!
You just convinced me to ditch quad9.
I don't see contact info on your profile or website/blog, but I can post here what the outcome is
Edit: I love your blog's theme btw!
The templates you need are:
1.1.1.1: https://cloudflare-dns.com/dns-query
9.9.9.9: https://dns.quad9.net/dns-query
8.8.8.8: https://dns.google/dns-query
See https://learn.microsoft.com/en-us/windows-server/networking/... for info on how to set the templates.
This "URL template" thing seems odd – is Windows doing something like creating a URL out of the DNS IP and a pattern, e.g. 1.1.1.1 + "https://<ip>/foo" would yield https://1.1.1.1/foo?
If so, why not just allow providing an actual URL for each server?
It doesn't say they sell traffic logs outright, but they do send telemetry on blocked domains to the blocklist provider, and provides "a sparse statistical sampling of timestamped DNS responses" to "a very few carefully vetted security researchers". That's not exactly "selling traffic logs", but is fairly close. Moreover colloquially speaking, it's not uncommon to claim "google sells your data", even they don't provide dumps and only disclose aggregated data.
The part about sharing data with "a very few carefully vetted security researchers" doesn't preclude them from leaking domains. For instance if the security researcher exports a "SELECT COUNT(*) GROUP BY hostname" query that would arguably count as "summary form", and would include any secret hostnames.
>https://quad9.net/privacy/policy/#22-data-collected
If you're trying to imply that they can't possibly be leaking hostnames because they don't collect hostnames, that's directly contradicted by the subsequent sections, which specifically mention that they share metrics grouped by hostname basis. Obviously they'll need to collect hostname to provide such information.
Right, but the privacy policy also says there's a separate program for "a very few carefully vetted security researchers" where they can get data in "summary form", which can leak domain name in the manner I described in my previous comment. Maybe they have a great IRB (or similar) that would prevent this from happening, but that's not mentioned in the privacy policy. Therefore it's totally in the realm of possibility that secret domain names could be leaked, no "really having a go with the shoehorn" required.
You should probably be using a trusted TLS certificate for your git hosting. And that means the host name will end up in certificate transparency logs which are even easier to scrape than DNS queries.
> How Quad9 protects your privacy?
> When your devices use Quad9 normally, no data containing your IP address is ever logged in any Quad9 system.
Of course they have some kinds of logs. Aggregating resolved domains without logging client IPs is not what the implication of "Quad9 is reselling the traffic logs" seems to be.
I am curious though, do you have any suggestions for alternative DNS that is better?
The very idea strikes me as irresponsible and misguided.
These, if you don't know the host, you will not be able to hit the backend service. But if you know, you can start exploiting it, either by lack of auth, or by trying to exploit the software itself
If your device doesn't support proper failover use a local DNS forwarder on your router or an external one.
In Switzerland I would use Init7 (isp that doesn't filter) -> quad9 (unfiltered Version) -> eu dns0 (unfiltered Version)
I get that in theory blah blah, but we now have choices in who gets to see all of our requests and the ISP will always lose out to the other losers in the list
If you choose a resolver that is very far, 100ms longer page loads do end add up quickly...
Note how root "." just works and has done for decades - that's proper engineering and actually way more complicated than running 1.1.1.1. What 1.1.1.1 suffers from is anycast and not DNS.
Cloudflare (and Google and co) insist on using one or more "vanity" IP addresses - that is very unfair of me but that it what it is, and to make it work, they have to use anycast.
The real issue is fixing anycast and not DNS.
Anyway, select two+ providers and set them.
Unless you do something fancy with a local caching dns proxy with more than one upstream.
I would count not configuring at least two as 'user error'. Many systems require you to enter a primary and alternate server in order to save a configuration.
Traditional monitoring systems like Nagios and Icinga have settings where they only open events/alerts if a check failed three times in a row, because spurious failed checks are quite common.
If you spam your operators with lots of alerts for monitoring checks that fix themselves, you stress the unnecessarily and create alert blindness, because the first reaction will be "let's wait if it fixes itself".
I've never operated a service with as much exposure as CF's DNS service, but I'm not really surprised that it took 8 minutes to get a reliable detection.
Were it much less than 1.1.1.1 itself, taking longer than a minute to alarm probably wouldn’t surprise me, but this is 1.1.1.1, they’re dealing with vasts amounts of probably fairly consistent traffic.
I don’t want to devolve this to an argument from authority, but - there’s a lot of trade offs to monitoring systems, especially at that scale. Among other things, aggregation takes time at scale, and with enough metrics and numbers coming in, your variance is all over the place. A core fact about distributed systems at this scale is that something is always broken somewhere in the stack - the law of averages demands it, and so if you’re going to do an all-fire-alarm alert any time part of the system isn’t working, you’ve got alarms going off 24/7. Actually detecting that an actual incident is actually happening on a machine of the size and complexity we’re talking about within 5 minutes is absolutely fantastic.
Thing is, it's probably still some engineering effort, and most orgs only really improve their monitoring after it turned out to be sub-optimal.
Let's say you've got a metric aggregation service, and that service crashes.
What does that result in? Metrics get delayed until your orchestration system redeploys that service elsewhere, which looks like a 100% drop in metrics.
Most orchestration take a sec to redeploy in this case, assuming that it could be a temporary outage of the node (like a network blip of some sort).
Sooo, if you alert after just a minute, you end up with people getting woken up at 2am for nothing.
What happens if you keep waking up people at 2am for something that auto-resolves in 5 minutes? People quit, or eventually adjust the alert to 5 minutes.
I know you often can differentiate no data and real drops, but the overall point, of "if you page people constantly, people will quit" I think is the important one. If people keep getting paged for too tight alarms, the alarms can and should be loosened... and that's one way you end up at 5 minutes.
When you are building systems like 1.1.1.1 having an alert rollup of five minutes is not acceptable as it will hide legitimate downtime that lasts between 0 and 5 minutes.
You need to design systems which do not rely on orchestration to remediate short transient errors.
Disclosure: I work on a core SRE team for a company with over 500 million users.
Now without crying: I saw multiple, big companies getting rid of NOC and replacing that with on duties in multiple, focused teams. Instead of 12 people sitting 24/7 in group of 4 and doing some basic analysis and steps before calling others - you page correct people in 3-5 minutes, with exact and specific alert.
Incident resolution times went greatly down (2-10x times - depends on company), people don’t have to sit overnight and sleep for most of the time and no stupid actions like service restart taken to slow down incident resolution.
And I’m not liking that some platforms hire 1500 people for job that could be done with 50-100, but in terms of incident response - if you already have teams with separated responsibilities then NOC it’s "legacy"
(Have worked as SRE at large global platform)
I just mostly over the last few years tune out such responses and try not to engage them. The whole uninformed "Well, if it were me, I would simply not do that" kind of comment style has been pervasive on this site for longer than AI though, IMO.
It took me a very long time to realize that^. I've worked with two NOC at two huge companies, and i know they still exist as teams at those companies. I'm not an SWE, though. And I'm not certain i'd qualify either company as truly "global" except in the loosest sense - as in, one has "American" in the name of the primary subsidiary.
^ i even regularly have used "the comments were people incorrecting each other about <x>", so i knew subconsciously that HN is just a different subset of general internet comments. The issue comes from this site appearing to be moderated, and the group of people that select for commenting here seem like they would be above average at understanding and backing up claims. The "incorrecting" label comes from n-gate, which hasn't been updated since the early '20s, last i checked.
Step 1: You start out with the founders being on call 27x7x365 or people in the first 10 or 20 hires "carry the pager" on weekends and evenings and your entire company is doing unpaid rostered on call.
Step 2: You steal all the underwear.
Step 3: You have follow-the-sun office-hours support staff teams distributed around the globe with sufficient coverage for vacations and unexpected illness or resignations.
<google google google>
"Original air date: December 16, 1998"
Oh, right. Half of you weren't even born... Now I feel ooooooold.
Before you fire a quick alarm, check that the node is up, check that the service is up etc.
Operating at the scale of cloudflare? A lot.
* traffic appears to be down 90% but we're only getting metrics from the regions of the world that are asleep because of some pipeline error
* traffic appears to be down 90% but someone put in a firewall rule causing the metrics to be dropped
* traffic appears to be down 90% but actually the counter rolled over and prometheus handled it wrong
* traffic appears to be down 90% but the timing of the new release just caused polling to show wierd numbers
* traffic appears to be down 90% but actually there was a metrics reporting spike and there was pipeline lag
* traffic appears to be down 90% but it turns out that the team that handles transit links forgot to put the right acls around snmp so we're just not collecting metrics for 90% of our traffic
* I keep getting alerts for traffic down 90%.... thousands and thousands of them, but it turns out that really its just that this rarely used alert had some bitrot and doesn't use the aggregate metrics but the per-system ones.
* traffic is actually down 90% because theres an internet routing issue (not the dns team's problem)
* traffic is actually down 90% at one datacenter because of a fiber cut somewhere
* traffic is actually down 90% because the normal usage pattern is trough traffic volume is 10% of peak traffic volume
* traffic is down 90% from 10s ago, but 10s ago there was an unusual spike in traffic.
And then you get into all sorts of additional issues caused by the scale and distributed nature of a metrics system that monitors a huge global network of datacenters.
Not sure how cloudflare keeps struggling with issues like these, this isn't the first (and probably won't be the last) time they have these 'simple', 'deprecated', 'legacy' issues occuring.
8.8.8.8+8.8.4.4 hasn't had a global(1) second of downtime for almost a decade.
1: localized issues did exist, but that's really the fault of the internet and they did remain running when google itself suffered severe downtime in various different services.
European users might prefer one of the alternatives listed at https://european-alternatives.eu/category/public-dns over US corporations subject to the CLOUD act.
i have musknet, though, so i can't edit the DNS providers on the router without buying another router, so cellphones aren't automatically on this plan, nor are VMs and the like.
Having a fully configured spare pi-hole in a box also helps. Another time my pi-hole refused to boot after a power outage.
So i went to best buy and bought 3 routers, and set each one up for 1 week. Turns out, you can get public routable ipv6 with a third party router, if the router supports ipv6.
I still see people mentioning opnsense and pfsense on here from time to time, and i wonder if i got the wrong - maybe outdated - iso images? I also tried doing it with freebsd and debian and couldn't figure it out, which is a bit depressing for me. I'll try again someday.
You ask .com resolver for domain.com's NS, and then you ask ns1.domain.com for foo.domain.com. Then you browse to wikipedia.org, and none of those DNS queries go to the same place as the previous site.
Cloudflare has a reasonable culture around incident response, but it doesn't incentivize proactive prevention.
From the longer term graphs it looks like volume returned to normal https://imgur.com/a/8a1H8eL
> It’s worth noting that DoH (DNS-over-HTTPS) traffic remained relatively stable as most DoH users use the domain cloudflare-dns.com, configured manually or through their browser, to access the public DNS resolver, rather than by IP address.
Interesting, I was affected by this yesterday. My router (supposedly) had Cloudflare DoH enabled but nothing would resolve. Changing the DNS server to 8.8.8.8 fixed the issues.
It’s corporate newspeak. “legacy” isn’t a clear term, it’s used to abstract and obfuscate.
> Legacy components do not leverage a gradual, staged deployment methodology. Cloudflare will deprecate these systems which enables modern progressive and health mediated deployment processes to provide earlier indication in a staged manner and rollback accordingly.
I know what this means, but there’s absolutely no reason for it to be written in this inscrutable corporatese.
I will not say whether or not it’s acceptable for a company of their size and maturity, but it’s definitely not hidden in corporate lingo.
I do believe they could have elaborate more on the follow up steps they will take to prevent this from happening again, I don’t think staggered roll outs are the only answer to this, they’re just a safety net.
It's carefully written so my boss's boss thinks he understands it, and that we cannot possibly have that problem because we obviously don't have any "legacy components" because we are "modern and progressive".
It is, in my opinion, closer to "intentionally misleading corporatese".
Or they have a different definition of impact than I do
Note that this introduces one query overhead per DNS request if the previous cache has expired. For this reason, I've been using https://1.1.1.1/dns-query instead.
In theory, this should eliminate that overhead. Your operating system can validate the IP address of the DNS response by using the Subject Alternative Name (SAN) field within the CA certificate presented by the DoH server: https://g.co/gemini/share/40af4514cb6e
LetEncrypt are trialling ip address https/TLS certificates right now:
https://letsencrypt.org/2025/07/01/issuing-our-first-ip-addr...
They say:
"In principle, there’s no reason that a certificate couldn’t be issued for an IP address rather than a domain name, and in fact the technical and policy standards for certificates have always allowed this, with a handful of certificate authorities offering this service on a small scale."
DigiCert does. That is where 1.1.1.1 and 9.9.9.9 get their valid certificates from
So certs were often tied with identity which an IP really isn’t so few providers offered them.
There are two main reasons IP certificates were not widely used in the past:
- Before the SAN extension, there was just the CN, and there's only one CN per certificate. It would generally be a waste to set your only CN to a single IP address (or spend more money on more certs and the infrastructure to maintain them). A domain can resolve to multiple IPs, which can also be changed over time; users usually want to go to e.g. microsoft.com, not whatever IP that currently resolves to. We've had SANs for awhile now, so this limitation is gone.
- Domain validation (serve this random DNS record) involves ordinary forward-lookup records under your domain. Trying to validate IP addresses over DNS would involve adding records to the reverse-lookup in-addr.arpa domain which varies in difficulty from annoying (you work for a large org that owns its own /8, /16, or /24) to impossible (you lease out a small number of unrelated IPs from a bottom-dollar ISP). IP addresses are much more doable now thanks to HTTP validation (serve this random page on port 80), but that was an unnecessary/unsupported modality before.
Your operating system can validate the IP address of the DNS response by using the Subject Alternative Name (SAN) field within the CA certificate presented by the DoH server: https://g.co/gemini/share/40af4514cb6e
How is the IP address of the DoH server obtained?
> network.trr.bootstrapAddress
> (default: none) by setting this field to the IP address of the host name used in "network.trr.uri", you can bypass using the system native resolver for it. Use this to get the IPs of the cloudflare server: https://dns.google/query?name=mozilla.cloudflare-dns.com
> Starting with Firefox 74 setting the bootstrap address is no longer required in mode 3. Firefox will attempt to use regular DNS in order to get the IP address of the trusted resolver. However, if DNS resolution of the resolver domain fails, setting the bootstrap address is again necessary.
TLDR; DoH was working
all-servers
server=8.8.8.8
server=9.9.9.9
server=1.1.1.1
If you were using systemd-resolved however, it retries all servers in the order they were specified, so it's important to interleave upstreams.
Using the servers in the above example, and assuming IPv4 + IPv6:
1.1.1.1
2001:4860:4860::8888
9.9.9.9
2606:4700:4700::1111
8.8.8.8
2620:fe::fe
1.0.0.1
2001:4860:4860::8844
149.112.112.112
2606:4700:4700::1001
8.8.4.4
2620:fe::9
will failover faster and more successfully on systemd-resolved, than if you specify all Cloudflare IPs together, then all Google IPs, etc.Also note that Quad9 is default filtering on this IP while the other two or not, so you could get intermittent differences in resolution behavior. If this is a problem, don't mix filtered and unfiltered resolvers. You definitely shouldn't mix DNSSEC validatng and not DNSSEC validating resolvers if you care about that (all of the above are DNSSEC validating).
I was handling an incident due to this outage. I ended up adding Google DNS resolvers using systemd-resolved, but I didn't think to interleave them!
dnsmasq with a list of smaller trusted DNS providers sounds perfect, as long as it is not considered bad etiquette to spam multiple DNS providers for every resolution?
But where to find a trusted list of privacy focused DNS resolvers. The couple I tried from random internet advice seemed unstable.
If I have issues with cloudflare what do I do?
I believe that they follow their published policies and have reasonable security teams. They're also both popular services, which mitigates many of the other types of DNS tracking possible.
https://developers.google.com/speed/public-dns/privacy https://developers.cloudflare.com/1.1.1.1/privacy/public-dns...
> OpenNIC (also referred to as the OpenNIC Project) is a user owned and controlled top-level Network Information Center offering a non-national alternative to traditional Top-Level Domain (TLD) registries; such as ICANN.
I need to do a write-up one day
server:
logfile: ""
log-queries: no
# adjust as necessary
interface: 127.0.0.1@53
access-control: 127.0.0.0/8 allow
infra-keep-probing: yes
tls-system-cert: yes
forward-zone:
name: "."
forward-tls-upstream: yes
forward-addr: 9.9.9.9@853#dns.quad9.net
forward-addr: 193.110.81.9@853#zero.dns0.eu
forward-addr: 149.112.112.112@853#dns.quad9.net
forward-addr: 185.253.5.9@853#zero.dns0.eu
If you want to eschew centralized DNS altogether, if you run a Tor daemon, it has an option to expose a DNS resolver to your network. Multiple resolvers if you want them.
I guess now we should start using a completely different provider as dns backup Maybe 8.8.8.8 or 9.9.9.9
[0] https://man7.org/linux/man-pages/man3/inet_aton.3.html#DESCR...
1.0.0.0/24 is a different network than 1.1.1.0/24 too, so can be hosted elsewhere. Indeed right now 1.1.1.1 from my laptop goes via 141.101.71.63 and 1.0.0.1 via 141.101.71.121, which are both hosts on the same LINX/LON1 peer but presumably from different routers, so there is some resilience there.
Given DNS is about the easiest thing to avoid a single point of failure on I'm not sure why you would put all your eggs in a single company, but that seems to be the modern internet - centralisation over resilience because resilience is somehow deemed to be hard.
I guess. I wouldn't have thought it worthwhile for 4 chars, but yes.
> 1.0.0.0/24 is a different network than 1.1.1.0/24 too, so can be hosted elsewhere.
I thought anycast gave them that on a single IP, though perhaps this is even more resilient?
You can see they are separate routes, say looking at Telia's routing IP
https://lg.telia.net/?type=bgp&router=fre-peer1.se&address=1...
https://lg.telia.net/?type=bgp&router=fre-peer1.se&address=1...
In this case they both are advertised from the same peer above, I suspect they usually are - they certainly come from the same AS, but they don't need to. You could have two peers with cloudflare with different weights for each /24
That said, it's a good idea to specifically pick multiple resolvers in different regions, on different backbones, using different providers, and not use an Anycast address, because Anycast can get a little weird. However, this can lead to hard-to-troubleshoot issues, because DNS doesn't always behave the way you expect.
And the closest resolving proxy DNS server for most of my machines is listening on their loopback interface. The closest such machine happens to be about 1m away, so is beaten out of first place by centimetres. (-:
It's a shame that Microsoft arbitrarily ties such functionality to the Server flavour of Windows, and does not supply it on the Workstation flavour, but other operating systems are not so artificially limited or helpless; and even novice users on such systems can get a working proxy DNS server out of the box that their sysops don't actually have to touch.
The idea that one has to rely upon an ISP, or even upon CloudFlare and Google and Quad9, for this stuff is a bit of a marketing tale that is put about by thse self-same ISPs and CloudFlare and Google and Quad9. Not relying upon them is not actually limited to people who are skilled in system operation, i.e. who they are; but rather merely limited by what people run: black box "smart" tellies and whatnot, and the Workstation flavour of Microsoft Windows. Even for such machines, there's the option of a decent quality router/gateway or simply a small box providing proxy DNS on the LAN.
In my case, said small box is roughly the size of my hand and is smaller than my mass-market SOHO router/gateway. (-:
Changed back to just using big resolvers and all those issues disappeared.
If you run your own recursive DNS server (I keep forgetting to use the right term) on a local network, you can hit the root servers directly, which makes that the most reliable possible DNS resolver. Yes you might get more cache misses initially but I highly doubt you'd notice. (note: querying the root nameservers is bad netiquette; you should always cache queries to them for at least 5 minutes, and always use DNS resolvers to cache locally)
I'd argue that accounting for poorly managed ISP resolvers is a critical part of reasoning about reliability.
In terms of my everyday usage, for the past couple of decades, cache miss delays are largely lost in the noise of stupidly huge WWW pages, artificial service greylisting delays, CAPTCHA delays, and so forth.
Especially as the first step in any full cache miss, a back-end query to the root content DNS server, is also just a round-trip over the loopback interface. Indeed, as is also the second step sometimes now, since some TLDs also let one mirror their data. Thank you, Estonia. https://news.ycombinator.com/item?id=44318136
And the gains in other areas are significant. Remember that privacy and security are also things that people want.
Then there's the fact that things like Quad9's/Google's/CloudFlare's anycasting surprisingly often results in hitting multiple independent servers for successive lookups, not yielding the cache gains that a superficial understanding would lead one to expect.
Just for fun, I did Bender's test at https://news.ycombinator.com/item?id=44534938 a couple of days ago, in a loop. I received reset-to-maximum TTLs from multiple successive cache misses, on queries spaced merely 10 seconds apart, from all three of Quad9, Google Public DNS, and CloudFlare 1.1.1.1. With some maths, I could probably make a good estimate as to how many separate anycast caches on those services are answering me from scratch, and not actually providing the cache hits that one would naïvely think would happen.
I added 127.0.0.1 to Bender's list, of course. That had 1 cache miss at the beginning and then hit the cache every single time, just counting down the TTL by 10 seconds each iteration of the loop; although it did decide that 42 days was unreasonably long, and reduced it to a week. (-:
</soapbox>
Judging by Cloudflare's privacy policy, they hold less personally identifiable information than my ISP while offering EDNS and low latencies? Win, win, win.
I recently started using the "luci-app-https-dns-proxy" package on OpenWrt, which is preconfigured to use both Cloudflare and Google DNS, and since DoH was mostly unaffected, I didn't notice an outage. (Though if DoH had been affected, it presumably would have failed over to Google DNS anyway.)
Anecdotally, I figured out their DNS was broken before it hit their status page and switched my upstream DNS over to Google. Haven't gotten around to switching back yet.
https://developers.cloudflare.com/1.1.1.1/faq/#does-1111-sen...
I've also changed to 9.9.9.9 and 8.8.8.8 after using 1.1.1.1 for several years because connectivity here is not very good, and being connected to the wrong data center means RTT in excess of 300 ms. Makes the web very sluggish.
Quad9 has a very aggressive blocking policy (my site with user-uploaded content was banned without even reporting the malicious content; if you're a big brand name it seems to be fine to have user-uploaded content though) which this would be a possible workaround for, but it may not take an nxdomain response as a resolver failure
Although, perhaps, having an external VPS with a dns proxy could be a good middle ground?
And it’s not conspiracy theory - it was very suspicious when we did some testing on small, aware group. The traffic didn’t look like being handled anonymously at Google side
Clients cache DNS resolutions to avoid having to do that request each time they send a request. It's plausible that some clients held on to their cache for a significant period.
It would be interesting to see the service level objective (SLO) that cloudflare internally has for this service.
I've found https://www.cloudflare.com/r2-service-level-agreement/ but this seems to be for payed services, so this outage would put July in the "< 99.9% but >= 99.0%" bucket, so you'd get a 10% refund for the month if you payed for it.
I find it somewhat surprising that none of the multiple engineers who reviewed the original change in June noticed that they had added 1.1.1.0/24 to the list of prefixes that should be rerouted. I wonder what sort of human mistake or malice led to that original error.
Perhaps it would be wise to add some hard-coded special-case mitigations to DLS such that it would not allow 1.1.1.1/32 or 1.0.0.1/32 to be reassigned to a single location.
But, yes, a second mitigation here would be defense in depth - in an ideal world, all your systems use the same ops/deploy/etc stack, in this one, you probably want an extra couple steps in the way of potentially taking a large public service offline.
Cloudflare's 1.1.1.1 Resolver service became unavailable to the Internet starting at 21:52 UTC and ending at 22:54 UTC
Weird. According to my own telemetry from multiple networks they were unavailable for a lot longer than that.EDIT: Appears I was wrong, it is failover not round-robin between the primary and secondary DNS servers. Thus, using 1.1.1.1 and 8.8.8.8 makes sense.
If you have a more advanced local resolver of some sort (systemd for example) you can configure whatever behaviour you want.
This writing is just brilliant. Clear to technical and non-technical readers. Makes the in-progress migration sound way more exciting than it probably is!
> We are sorry for the disruption this incident caused for our customers. We are actively making these improvements to ensure improved stability moving forward and to prevent this problem from happening again.
This is about as good as you can get it from a company as serious and important as Cloudflare. Bravo to the writers and vetters for not watering this down.
Maybe there is noticeable difference?
I have seen more outage incident reports of cloudflare than of google, but this is just personal anecdote.
Last 30 days, 8.8.8.8 has 99.99% uptime vs 1.1.1.1 has 99.09%
For me cloudflare 1.1.1.1 and 1.0.0.1 have a mean response time of 15.5ms over the last 3 months, 8.8.8.8 and 8.8.4.4 are 15.0ms, and 9.9.9.9 is 13.8ms.
All of those servers return over 3-nines of uptime when quantised in the "worst result in a given 1 minute bucket" from my monitoring points, which seem fine to have in your mix of upstream providers. Personally I'd never rely on a single provider. Google gets 4 nines, but that's only over 90 days so I wouldn't draw any long term conclusions.
Say what now? A test triggered a global production change?
> Due to the earlier configuration error linking the 1.1.1.1 Resolver's IP addresses to our non-production service, those 1.1.1.1 IPs were inadvertently included when we changed how the non-production service was set up.
You have a process that allows some other service to just hoover up address routes already in use in production by a different service?
I use their DNS over HTTPS and if I hadn't seen the issue being reported here, I wouldn't have caught it at all. However, this—along with a chain of past incidents (including a recent cascading service failure caused by a third-party outage)—led me to reduce my dependencies. I no longer use Cloudflare Tunnels or Cloudflare Access, replacing them with WireGuard and mTLS certificates. I still use their compute and storage, but for personal projects only.
The theory is CF had the capacity to soak up the junk traffic without negatively impacting their network.
If there were some way to view torrenting traffic, no doubt there'd be a 20 minute slump.
It is designed to be used in conjunction with 1.0.0.1. DNS has fault tolerance built in.
Did 1.0.0.1 go down too? If so, why were they on the same infrastructure?
This makes no sense to me. 8.8.8.8 also has 8.8.4.4. The whole point is that it can go down at any time and everything keeps working.
Shouldn’t the fix be to ensure that these are served out of completely independent silos and update all docs to make sure anyone using 1.1.1.1 also has 1.0.0.1 configured as a backup?
If I ran a service like this I would regularly do blackouts or brownouts on the primary to make sure that people’s resolvers are configured correctly. Nobody should be using a single IP as a point of failure for their internet access/browsing.
Yes.
> Shouldn’t the fix be to ensure that these are served out of completely independent silos [...]?
Yes.
> If so, why were they on the same infrastructure?
Apparently, they weren’t independent enough: something in CF has announced both addresses and that got out.
The solution for the end user is, of course, to use 1.1.1.1 and 8.8.8.8 (or any other combination of two different resolvers).
I use Cloudflare at work. Cloudflare has many bugs, and some technical decisions are absurd, such as the worker's cache.delete method, which only clears the cache contents in the data center where the Worker was invoked!!! https://developers.cloudflare.com/workers/runtime-apis/cache...
In my experience, Cloudflare support is not helpful at all, trying to pass the problem onto the user, like "Just avoid holding it in that way. ".
At work, I needed to use Cloudflare. The next job I get, I'll put a limit on my responsibilities: I don't work with Cloudflare.
I will never use Cloudflare at home and I don't recommend it to anyone.
Next week: A new post about how Cloudflare saved the web from a massive DDOS attack.
The Cache API is a standard taken from browsers. In the browser, cache.delete obviously only deletes that browser's cache, not all other browsers in the world. You could certainly argue that a global purge would be more useful in Workers, but it would be inconsistent with the standard API behavior, and also would be extraordinarily expensive. Code designed to use the standard cache API would end up being much more expensive than expected.
With all that said, we (Workers team) do generally feel in retrospect that the Cache API was not a good fit for our platform. We really wanted to follow standards, but this standard in this case is too specific to browsers and as a result does not work well for typical use cases in Cloudflare Workers. We'd like to replace it with something better.
To me, it only makes sense if the put method creates a cache only in the datacenter where the Worker was invoked. Put and delete need to be related, in my opinion.
Now I'm curious: what's the point of clearing the cache contents in the datacenter where the Worker was invoked? I can't think of any use for this method.
My criticisms aren't about functionality per see or developers. I don't doubt the developers' competence, but I feel like there's something wrong with the company culture.
That is, in fact, how it works. cache.put() only writes to the local datacenter's cache. If delete() were global, it would be inconsistent with put().
> Now I'm curious: what's the point of clearing the cache contents in the datacenter where the Worker was invoked? I can't think of any use for this method.
Say you read the cache entry but you find, based on its content, that it is no longer valid. You would then want to delete it, to save the cost of reading it again later.
Thanks, I didn't know that (I don't remember reading it in the documentation)
That said, I don't use workers and don't plan to. I personally try to stay away from non cross-platform stuff because I've been burned too heavily with vendor/platform lock-in in the past.
If we changed an API in Workers in a way that broke any Worker in production, we consider that an incident and we will roll it back ASAP. We really try to avoid this but sometimes it's hard for us to tell. Please feel free to contact us if this happens in the future (e.g. file a support ticket or file a bug on workerd on GitHub or complain in our Discord or email kenton@cloudflare.com).
If we start using workers though I'll definitely let you know if any API changes!
Like mentioned by other comments, do it on your own if you are not happy with the stability. Or just pay someone to provide it - like your ISP..
And TBH I trust my local ISP more than Google or CF. Not in availability, but it's covered by my local legislature. That's a huge difference - in a positive way.
which might not be a good thing in some jurisdictions - see the porn block in the UK (it's done via dns iirc, and trivially bypassed with a third party dns like cloudflare's).
So far I'm lucky and the only ban I'm aware of is on gambling. Which is fine for me personally.
But in a UK case I'd using a non local one as well.
I don't think this is fair when discussing infrastructure. It's reasonable to complain about potholes, undrinkable tap water, long lines at the DMV, cracked (or nonexistent) sidewalks, etc. The internet is infrastructure and DNS resolution is a critical part of it. That it hasn't been nationalized doesn't change the fact that it's infrastructure (and access absolutely should be free) and therefore everyone should feel free to complain about it not working correctly.
"But you pay taxes for drinkable tap water," yes, and we paid taxes to make the internet work too. For some reason, some governments like the USA feel it to be a good idea to add a middle man to spend that tax money on, but, fine, we'll complain about the middle man then as well.
DNS is infrastructure. But "Cloudflare Public Free DNS Resolver" is not, it's just a convenience and a product to collect data.
(This isn't a major concern, of course; and I mention it just to extend your argument yet further. The major gain of a private root content DNS server is the fraction of really stupid nonsense DNS traffic that comes about because of various things gets filtered out either on-machine or at least without crossing a border router. The gains are in security and privacy more than uptime.)
>"But you pay taxes for drinkable tap water," yes, and we paid taxes to make the internet work too. For some reason, some governments like the USA feel it to be a good idea to add a middle man to spend that tax money on, but, fine, we'll complain about the middle man then as well.
You don't want DNS to be nationalized. Even the US would have half the internet banned by now.
But opposite to tap water there are a lot of different free DNS resolvers that can be used.
And I don't see how my taxes funded CFs DNS service. But my ISP fee covers their DNS resolving setup. That's the reason why I wrote
> a service that's free of charge
Which CF is.
I did this for a while, but ~300ms hangs on every DNS resolution sure do get old fast.
With something like a N100- or N150-based single board computer (perhaps around $200) running any number of open source DNS resolvers, I would expect you can average around 30 ms for cold lookups and <1 ms for cache hits.
Edit: How to serve the root zone locally with unbound. https://old.reddit.com/r/pihole/comments/s43o8j/where_does_u...
[0] dig axfr . @k.root-servers.net
[0]: https://root-servers.org/ [1]: https://github.com/jschauma/tld-zoneinfo
Even if a root server wasn't in the US, it will still be pretty slow for me. Europe is far worse. Most of Asia has bad paths to me, except for Japan and Singapore which are marginally better than the US. Maybe Aus has one...?
Incompetent admins. dnsmasq at least has an option to override it (--min-cache-ttl=<time>)
When DNS resolver is down, it affects everything, 100% uptime is a fair expectation, hence redundancy. Looks like both 1.0.0.1 and 1.1.1.1 were down for more than 1h, pretty bad TBH, especially when you advise global usage.
RCA is not detailed and feels like a marketing stunt we are now getting every other week.
But I do appreciate these types of detailed public incident reports and RCAs.
Very frustrating.
Secondary DNS is supposed to be in an independent network to avoid precisely this.
Not sure what the "advantage" of stub resolvers is in 2025 for anything.
What caused this specific behavior is the dilemma of backwards comparability when it comes to BGP security. We area long ways off from all routes being covered by rpki, (just 56% of v4 routes according to https://rpki-monitor.antd.nist.gov/ROV ) so invalid routes tend to be treated as less preferred, not rejected by BGP speakers that support RPKI.
I know.