Route leak incident on January 22, 2026(blog.cloudflare.com)

165 pointsby nomaxx11715 days ago16 comments

arjie15 days ago
Based on the number of times I've seen these posted about they seem quite frequent[0]. If I'm being honest, the entire BGP system seems to be very fragile with a massive blast radius. I get that it's super 'core' so it's hard to fix, and that it comes from a time when the Internet was more 'cooperative' (in the protocol sense of the word) but are there any attempts at a successor or is it impossible to do so fundamentally?
Surely the notion of who owns an AS should be cryptographically held so that an update has to be signed. Updates should be infrequent so the cost is felt on the control plane, not on the data plane.
I'm sure there's a BGPSec or whatever like all the other ${oldTech}Sec but I don't know if there is a realistic solution here or if it's IPv6 style tech.
0: I looked it up before posting and it's 3000 leakers with 12 million leaks per quarter https://blog.qrator.net/en/q3-2022-ddos-attacks-and-bgp-inci...
- direwolf2015 days ago
  Globally, it is as you want it to be.
  Locally, BGP is peer-to-peer — literally! — and no particular peer is forced to check everything, and nobody's even trying to make a single global routing table so local agreements can override anything at a higher level.
  - arjie15 days ago
    I see. That makes sense.
    direwolf2015 days ago
    A route leak is often like this: an ISP in Pakistan is ordered to censor YouTube, so they add a route internally to YouTube's IP addresses that passes to their censoring machine, or to nowhere. They accidentally have their edge routers configured to pass this route to all their connected networks instead of keeping it internally to themselves. Some of their peers recognize this as the shortest route to YouTube and install it into their own networks. Others recognize it's not the real YouTube and ignore it. Transit providers check route authorization more thoroughly than peers, so none of them accept it and the route doesn't spread globally.
    j16sdiz15 days ago
    sometimes it is just innocent:
    An isp have lease a new 10Gb fiber to youtube for my own customers, the route is leaked to my peer and now every isp in the whole country is using my fiber for youtube.
- patmorgan2315 days ago
  There's several enhancements that have been strapped on to BGP over the years. The article talks about two at the end that will help reduce route leaks.
  A wholesale protocol replacement is unlikely, but definitely more doable than replacing something like IP.
- _bernd14 days ago
  Here you go: https://en.wikipedia.org/wiki/Resource_Public_Key_Infrastruc...
  - eqvinox14 days ago
    That, and ASPA, and https://manrs.org/
- arianvanp14 days ago
  Check out https://www.scion.org/
jacquesm15 days ago
That's like what, one major incident per month now, Nov 18, Dec 5, and now this one?
I'll bet JGC can write his own ticket by now, but unretiring would be really bad optics. He's on the board though and still keeping a watchful eye. But a couple more of these and CFs reputation will be in the gutter.
- wiether15 days ago
  My understanding of Cloudflare's history is that they built their reputation and their client base on some high quality products.
  And instead on focusing on maintaining those, they decided to go for more money, first adding new features on their products (at the risk of breaking them) and then adding new products altogether in a move to start being an actual cloud provider.
  Priorities shifted from the quality products to pushing features daily, and the person who built and maintained the good products probably left or have been assigned to shinier products, leaving the base to decay.
  As a daily user, its quite frustrating to have a console that is getting far worse than AWS/Azure, and features that are more a POC than actual production-ready features.
- jgrahamc15 days ago
  https://blog.cloudflare.com/fail-small-resilience-plan/
  - jacquesm14 days ago
    Hm. Mixed feelings, I would like a more rigorous approach to this a lot better. CF is really too big to fail now, I've had absolutely no qualms about recommending CF but after the last couple of months I'm revising that until things are measurably better.
    Your legacy is one of showing how to apply good engineering principles to complex problems at scale and I think CF is risking that reputation right now.
- stingraycharles15 days ago
  That’s what I also thought when I saw this incident. I wonder if there’s something up internally at Cloudflare or that it was always like this.
  I feel like something such as a route leak should not be something that happens to Cloudflare. I’m surprised they set their systems up to allow this human error.
  - jacquesm15 days ago
    John left in April last year I think so it probably isn't directly related, so please take my comment in jest, but still it is worrisome, CF is in many ways 'too big to fail' and if this really becomes a regular thing it is going to cause a lot of people focused on their 'nines' to be pissed off.
    One thing to their credit though: BGP is full of complexity and it definitely isn't the first time that something like this goes wrong, it is just that at CF scale the impact is massive so there is no room for fuckups. But doing this sort of thing right 100% of the time is a really hard problem, and I'm happy I'm not in any way responsible for systems this important.
    Whoever is responsible learned a lot of valuable lessons today (you hope).
    rkagerer15 days ago
    The older I get, the less I buy into "too big to fail" arguments. I now view it as "can't fail soon enough". The sooner it breaks down, the sooner something better will supplant it.
    This last sentiment holds true generally since organizations no longer subject to meaningful competition inevitably squat on their laurels and stop excelling at the things they used to be good at. We've seen it everywhere - Boeing, Google, Microsoft (with OS's), etc.
    roenxi15 days ago
    There was never much of an argument behind "too big to fail", it is generally a euphemism for upper-class welfare. In a more realist world, "too big to fail" is a mis-statement of "too risky to keep". Everything fails eventually and keeping incentives aligned relies on having a mechanism - failure - to flush out incompetents.
    mschuster9115 days ago
    > The sooner it breaks down, the sooner something better will supplant it.
    That's not always possible, because the counterparty - aka threat actors - is always growing bigger, and you practically need to be the size of Cloudflare, Akamai or the Big 3 cloud providers to be able to weather attacks. You need to have big enough pipes to data centers and exchange points worldwide, otherwise any sufficiently motivated attacker can just go and swamp them, but big pipes are helluvalot expensive so you need to have enough large and financially capable customers.
    That's also why Cloudflare has expanded their offerings so much (e.g. Zero Trust), they need to have their infrastructure at some base load to economically justify it.
    And that's also why Cloudflare will not be kicked off the throne any time soon. First of all, the initial costs to set up a competitor are absurdly high, second, how is a competitor supposed to lure large long term customers away from CF?
    Any case, the real "fix" to Cloudflare being too-big-to-fail isn't building up competitors, it's getting the bad actors off of the Internet. Obviously that means holding both enemy (NK, Russia, China) and frenemy (India, Turkey) nations accountable, but it also means cleaning up shop at home - the aforementioned nation states and their botnet operators rely on an armada of hacked servers, ordinary computers and IoT devices in Western countries to carry out the actual work. And we clearly don't do anywhere near enough to get rid of these. I 'member a time when writing an abuse@ mail report that this would be taken seriously and the offender being disconnected by their ISP. These days, no one gives a fuck.
    bflesch15 days ago
    "Threat actor" is a relative definition, because for Italy the Cloudflare CEO was a "threat actor" who openly threatened availability of their systems.
    Cloudflare knows they are just a glorified firewall + CDN that's why they desperately push into edge computing and getting these dozens of features.
  - re-thc15 days ago
    > or that it was always like this
    The focus has been on new features and moving fast for quite some years vs reliability.
- vpShane15 days ago
  They made themselves 'Guardians of The Internet' then gave up. If they cared, these things wouldn't happen. How many more outages, accidents, incidents that effect millions of customers and millions of customers for other services are needed before they 'care'?
  They don't, because at the end of the day it's not their problem, the money rolls in regardless.
  It's sad, but it's how it is. If they cared, these things wouldn't happen. They have a lot of responsibility, but show none whatsoever.
colinbartlett15 days ago
I do appreciate these post mortems from Cloudflare, however I wish they would include timestamps of their status page posts in their timelines.
In this case, the timeline states "IMPACT STOP" was at 20:50 UTC and the first post to their status page was 12 minutes later at 21:02 UTC:
"Cloudflare experienced a Network Route leak, impacting performance for some networks beginning 20:25 UTC. We are working to mitigate impact."
0xy15 days ago
The string of recent incidents don't really make the new CTO look good. Too much focus on shipping, not enough on shipping correctly.
- iLoveOncall15 days ago
  Welcome to the age of AI-assisted coding.
  - SketchySeaBeast15 days ago
    I could have sworn "move fast and break things" existed before AI.
    Atreiden15 days ago
    It did, but AI redefined the term "fast"
btown15 days ago
> we pushed a change via our policy automation platform to remove the BGP announcements from Miami
Is there any way to test these changes against a simulation of real world routes? Including to ensure that traffic that shouldn’t hit Cloudflare servers, continues to resolve routes that don’t hit Cloudflare?
I have to imagine there’s academic research on how to simulate a fork of global BGP state, no? Surely there’s a tensor representation of the BGP graph that can be simulated on GPU clusters?
If there’s a meta-rule I think of when these incidents occur, it’s that configuration rules need change management, and change management is only as good as the level of automated testing. Just because code hasn’t changed doesn’t mean you shouldn’t test the baseline system behavior. And here, that means testing that the Internet works.
- PunchyHamster15 days ago
  > Is there any way to test these changes against a simulation of real world routes? Including to ensure that traffic that shouldn’t hit Cloudflare servers, continues to resolve routes that don’t hit Cloudflare?
  You can get access to view of routes from different parts of networks but you do not have access to those routers policies, so no
  > I have to imagine there’s academic research on how to simulate a fork of global BGP state, no? Surely there’s a tensor representation of the BGP graph that can be simulated on GPU clusters?
  Just simulating your peers and maybe layer after is most likely good enough. And you can probably do it with a bunch of cgroups and some actual routing software. There are also network sims like GNS3 that can even just run router images
- hnuser12345615 days ago
  You can cross-reference RADB, the RIRs, and looking glass servers, and you'd find 3 different pictures of the internet.
- Analemma_15 days ago
  I assume it's not possible unless you know the in-memory state of all the other gateway routers on the internet, no? You can know what they advertise, but that's not the same thing as a full description of their internal state and how they will choose to update if a route gets withdrawn.
  - erredois15 days ago
    I think you could know the state of the peers and simulate what they advertise and receive and validate that. The test unit would need to be a simulated router that behaves exactly as the real one, I actually think its technically doable with tight version control for routers.
- toast015 days ago
  I don't know why you would need a tensor whatever. Dump the state of the router (which peers are connected and for how long what routes are they advertising and for how long) as well as the computed routing table and what routes are advertised to peers.
  Set a simulation router to have the same state but a new config, and compute the routing table and what routes would he advertised to peers.
  Confirm the diff in routing table and advertised routes is reasonable.
  This change seemed to mostly be about a single location. Other BGP config changes leading to problems are often global changes, but you can check diffs and apply the config change one host at a time. You can't really make a simultaneous change anyway. Maybe one host changing is ok, but the Nth one causes a problem... CF has a lot of BGP routers, so maybe checking every diff is too much, but at least check a few.
  Is that something out of the box on routers? I don't know, people with BGP routers never let me play with them. But given the BGP haiku, I'd want something like that before I messed around with things. For the price you pay for these fancy routers, you should be able to buy an extra few to run sandboxed config testing on. You could also simulate with open source bgp software, but the proprietary BGP daemon on the router might not act like the open source one does.
dfajgljsldkjag15 days ago
We already have the tools to stop this from happening today. The problem is not the technology but the fact that companies do not want to work together to fix it. It is sad that we let the internet break because people are too slow to use the safety features we have.
- tjwebbnorfolk15 days ago
  If a bunch of big tech companies started collaborating/colluding to implement this, we'd just have a bunch of people on HN decrying the "centralization" of the internet concentrated in the hands of a few.
  This is decentralization in action. You have to take the good with the bad.
  - redeeman15 days ago
    there are definitely ways to make the bad much less bad
    tjwebbnorfolk14 days ago
    If "bad" is that things are only working 99.999% of the time, and every couple years someone borks their BGP config and fixes it in a few minutes, then bad sounds pretty good to me.
    In large complex systems, perfection isn't really possible.
PunchyHamster15 days ago
Damn, I missed the fact Juniper was acquired by HPE, RIP
- eqvinox14 days ago
  https://chaos.social/@equinox/111752488503367272
  (disclaimer: shitpost. my shitpost.)
vlovich12315 days ago
I’m a huge fan of flapping when it’s really hard to do progressive rollouts. What this would mean here is you switch advertising the old and new routes back and forth automatically and this happens let’s say for 1 minute max before the old config is restored. Then a human looks at various metrics before they push a button to really make the new config permanent. It gives you a cheap way to preflight what will happen when you make a globally impacting config change.
- arter4515 days ago
  I’m not sure this would be a good idea in this kind of change.
  Flapping is bad in the networking world.
  Flapping BGP routes, specifically, is bad because it can stress all BGP routers involved to the point where they can “go crazy”. Routes are explicitly advertised, so if you keep changing the routes, you are tasking the router CPU to process new stuff, discard it and process new stuff. In fact, BGP route flaps are specifically the focus of an entire RFC: https://datatracker.ietf.org/doc/html/rfc2439
  More in general, a flapping link (on/off/on/off) can really mess with TCP.
  Flapping in the networking world is not something you want to do intentionally.
  - vlovich12315 days ago
    Ok. You can flap it slower and less frequently. The RFC you mentioned talks about timers on the order of a minute or so. So I would say advertise the new route for 1 minute and unconditionally restore for 10. Then only after that advertise for 2 minutes and restore for 10. There’s clearly some interval of on/off that isn’t a problem and that’s an effective way to evaluate the impact you of a deployed route change gradually over time rather than fucking up the internet for 25 minutes until someone figures out what’s going on.
    And obviously you don’t do this on every individual route change - you batch them so it’s a release train.
    If you think there’s better techniques other than “don’t break things” I’m all for it.
    arter4515 days ago
    This specific outage is the equivalent of this scenario:
    You have an if/then statement with N conditions in AND. You remove one condition, leaving the rest of the statement unchanged. What happens?
    The answer is that if you remove one condition, your input (in this case routes) is more likely to match N-1 conditions than N, so more input is going to be processed according to the “then” clause.
    The impact of course depends on the fact that these were BGP routes, advertised to the Internet,… but the problem itself is generic.
    What can you do?
    1) check this kind of if/then statements with special care, in order to analyze under which condition the input is processed by the “then” clause. This is exactly one of their followups [1]
    2) consider adding “global”, catch-all policies acting as an additional safety net (if applicable)
    3) test your changes not just syntactically. Set up a test environment with multiple routers, apply the configuration and see what happens.
    [1] Adding automatic routing policy evaluation into our CI/CD pipelines that looks specifically for empty or erroneous policy terms
    vlovich12315 days ago
    Yeah, but all of those basically boil down to “the next outage will look different from before” which fine but isn’t an actual solution IMO.
    My point is you want to do that and gradual rollouts that you don’t make permanent until you’ve observed the real world behavior if you want to prevent all future outages. This specific temporary rollout and automatic rollback also has the side effect that even if you don’t do any of the “hardening steps” outlined, your system will still prevent any kind of mistake you’ve made from rolling out and becoming more permanent. Like I said, the “flapping” parameters can be tuned however you want and you can aggregate updates into an automated “release train”. If you want you can do so with automated health metrics although it can be hard to implement automating validation that behavior before and after the route is “correct” (maybe trained ML models would be helpful here).
    This is btw in many ways how Google releases code into production - they bundle a bunch of PRs into a giant “publish” step - if CI fails or anything in production fails, they automatically rollback the entire set of changes since they can’t know which part of the release went bad. It’s a huge hammer to solve any issue they didn’t account for.
    arter4515 days ago
    Rollback is of course useful when things go wrong (and by the way the routers CF use natively support rollback features). What I’m questioning is flapping as a structured way to carry out network changes.
    Even a slow flap can cause issues downstream. Imagine a router handling hundreds of thousands of routes. Its software has a memory leak so any route received increases its RAM usage. A slow flap may well bring that router to a halt. Now you might say, “hey, this is not my fault”, but it is still something that could happen to your routers or your peers.
    Another aspect is that network devices can get Terabits/s of traffic. Now, a router is mostly stateless, but if you do this flapping thing to a firewall, what you get is a lot of sessions with behavior1 and then switching to behavior2 and so on, which can cause high buffer utilization or packet drops.
    So, yes, of course you “flap” (rollback) when things go wrong, but you probably don’t do it intentionally to test what’s going on in a network change.
    vlovich12315 days ago
    > Its software has a memory leak so any route received increases its RAM usage.
    Surely you realize this as a weak reason but thought the argument against is that it’s my problem for someone else’s misbehaving software? I mean anyone sane in networking would treat this as not their problem (or at least work with the major providers for whom it is to make this possible).
    However the strongest reason why I don’t buy this is that routes change regularly as a matter of course so changing a route forward and back is no different from changing it twice and so this bug would already be causing you issues and this is maybe a small percentage of extra advertisements.
    > what you get is a lot of sessions with behavior1 and then switching to behavior2 and so on, which can cause high buffer utilization or packet drops.
    Again, this explanation largely relies on FUD rather than concrete explanations. BGP routes change regularly and often. Such issues if they exist are already problems and briefly advertising a new route for a period of time as a dry run doesn’t alter those issues in any meaningful way. The problem is you’re treating “flap” as somehow magically different from any normal route change when it’s not really meaningfully so.
    arter4515 days ago
    In the session scenario, I was talking about firewalls, not BGP routers (although, of course, you could have firewall features on a BGP router).
    What I'm saying is, there are ways to validate and carry out network changes in a pretty robust way, including gradual rollout (if that's what you want) by using route or firewall rules priority or other mechanisms.
    I keep being skeptical about this flapping strategy, but if this works in your setup, good for you.
- eqvinox15 days ago
  > It gives you a cheap way to preflight what will happen when you make a globally impacting config change.
  Your "1-minute flap" can propagate and trigger load on every single DFZ BGP router on the planet. That's not cheap.
  And 1 minute is too short to even propagate across carriers. There are all kinds of timers working to reduce previous point; your update can still be propagating half an hour later. It can also change state for when you do it for real. And worst of all, BGP routes can get stuck. It's rare, but a real problem.
  - vlovich12315 days ago
    Ok. 5 minutes. The point is clearly there’s route changes happening globally already. It should not be that much extra work to add like 10% more route changes (again - you’d batch the new route advertisements in one cohort rather than updating each individual route back and forth).
    And stuck routes are a problem but not one this would make worse since those routes would get stuck from normal changes anyway.
    The propagation problem isn’t real because clearly most route advertisements that handle most of the traffic actually happen quickly. You shouldn’t care about the long tail - you want to minimize the risk of your new route. The old route being present isn’t a problem and the new route disappearing back to the old also shouldn’t be a problem UNLESS the new route was buggy in which case you wanted to rollback anyway.
    TLDR: these don’t feel like risks unique to advertising and then undoing it given the route publishing already has to be handled anyway AND cloudflare is a major Tier 1 ISP and handles a good chunk of the entire internet’s traffic. This isn’t about a strategy for some random tier 2/3 ISP.
    eqvinox14 days ago
    > This isn’t about a strategy for some random tier 2/3 ISP.
    That's not a constraint you mentioned in your original post.
    > Ok. 5 minutes. The point is clearly there’s route changes happening globally already. It should not be that much extra work to add like 10% more route changes […]
    I see you haven't had to deal with the operational reality of devices handling things they weren't quite designed for, and/or have been overdue for replacement, and/or were just designed to the limit to begin with. Good for you. But your solution would affect the entire internet.
    If you're serious, you could try posting your suggestion to the NANOG or RIPE mailing lists. At the very least you'll probably learn a whole new set of expletives and curses… but I'd recommend against it.
- PunchyHamster15 days ago
  nice way to 100% the router CPUs for all your peers
ifwinterco15 days ago
Cloudfare were way too smug for years about how perfect they were, a string of issues was inevitable. Pride comes before a fall
betaby15 days ago
Weak engineering. Both from the CloudFlare side and their peers.
- 15 days ago
  undefined
freakynit15 days ago
It's almost always either the configuration change or the DNS lookup.
- arter4515 days ago
  Or the DNS configuration change :)
  - freakynit11 days ago
    Lol yeah haha
parhamn15 days ago
Their status pages were all green when we dealt with this.
vvilliamperez15 days ago
I initially misread that as "Routine incident"
tomofmanhattan15 days ago
With 365 Data Center we were down for *eight (8)* Hours. Thanks a lot Cloudflare!!
arter4515 days ago
I've had to read the RCA a couple of times to (probably) get what happened, even if I'm reasonably familiar with BGP.
Basically, my understanding (simplified) is:
- they originally had a Miami router advertise Bogota prefixes (=subnets) to Cloudflare's peers. Essentially, Miami was handling Bogota's subnets. This is not an issue.
- because you don't normally advertise arbitrary prefixes via BGP, policies were used. These policies are essentially if/then statements, carrying out certain actions (advertise or not, add some tags or remove them,...) if some conditions are matched. This is completely normal.
- Juniper router configuration for this kind of policy is (simplifying):
set <BGP POLICY NAME> from <CONDITION1>
set <BGP POLICY NAME> from <CONDITION2>
set <BGP POLICY NAME> then <ACTION1>
set <BGP POLICY NAME> then <ACTION2>
...
- prior to the incident, CF changed its network so that Miami didn't have to handle Bogota subnets (maybe Bogota does it on its own, maybe there's another router somewhere else)
- the change aimed at removing the configurations on Miami which were advertising Bogota subnets
- the change implementation essentially removed all lines from all policies containing "from IP in the list of Bogota prefixes". This is somewhat reasonable, because you could have the same policy handling both Bogota and, say, Quito prefixes, so you just want to remove the Bogota part.
HOWEVER, there was at least one policy like this:
(Before)
set <BGP POLICY NAME> from is_internal(prefix) == True
set <BGP POLICY NAME> from prefix in bogota_prefix_list
set <BGP POLICY NAME> then advertise
(After)
set <BGP POLICY NAME> from is_internal(prefix) == True
set <BGP POLICY NAME> then advertise
Which basically means: if you have an internal prefix advertise it
- an "internal prefix" is any prefix that was not received by another BGP entity (autonomous system)
- BGP routers in Cloudflare exchange routes to one another. This is again pretty normal.
- As a result of this change, all routes received by Miami through some other Cloudflare router were readvertised by Miami
- the result is CF telling the Internet (more accurately, its peers) "hey, you know that subnet? Go ask my Miami router!"
- obviously, this increases bandwidth utilization and latency for traffic crossing the Miami router.
- erredois15 days ago
  I am not very familiar with Juniper config, but this phrase summarizes it well. "This means we (AS13335) took the prefix received from Meta (AS32934), our peer, and then advertised it toward Lumen (AS3356), one of our upstream transit providers. " basically you should not receive a prefix from an eBGP session ( different AS) and advertize to an eBGP session. As they mention at the next steps, good use of communities could help avoiding it, in case of other misconfigurations.
  - arter4515 days ago
    Yes of course, but from a test perspective, this kind of mistake, given their configuration snippet and how they wrote the RCA, it seems to suggest they were simply diff-ing the initial and desired configs as any VCS would do (or, more likely, a Juniper “show|compare” command).
    This didn’t catch the fact that removing that line essentially removed all conditions, allowing received routes to be re-advertised by the Miami router.
    Communities are useful in this case, but this kind of thing could have happened with any kind of configuration.
    Example:
    (Before)
    set firewall family inet filter FILTER NAME term TERM1 from source-address 10.10.10.1
    set firewall family inet filter FILTER NAME term TERM1 from destination-port ssh
    set firewall family inet filter FILTER NAME term TERM1 then discard
    What happens when you remove references to 10.10.10.1, maybe because that IP is not blacklisted anymore? You’re simply removing one condition, leaving all ssh traffic to be discarded. That’s essentially what happened with the BGP outage, only here you have no BGP communities to save you.
    That’s why I re-read the RCA, because this kind of incident is way more general than BGP-specific misconfigurations.
reader927415 days ago
> and only affected IPv6 traffic
Why even bother to write an article about it then haha