Amazon holds engineering meeting following AI-related outages(www.ft.com)

68 pointsby petethomas6 hours ago19 comments

urban_winter4 hours ago
https://archive.ph/wXvF3
palmotea5 hours ago
> Amazon’s ecommerce business has summoned a large group of engineers to a meeting on Tuesday for a “deep dive” into a spate of outages, including incidents tied to the use of AI coding tools.
> The online retail giant said there had been a “trend of incidents” in recent months, characterised by a “high blast radius” and “Gen-AI assisted changes” among other factors, according to a briefing note for the meeting seen by the FT.
> Under “contributing factors” the note included “novel GenAI usage for which best practices and safeguards are not yet fully established”.
> “Folks, as you likely know, the availability of the site and related infrastructure has not been good recently,” Dave Treadwell, a senior vice-president at the group, told employees in an email, also seen by the FT.
- hansmayer4 hours ago
  > “Folks, as you likely know, the availability of the site and related infrastructure has not been good recently,” Dave Treadwell, a senior vice-president at the group, told employees in an email, also seen by the FT.
  Also some SVP over there: '"folks", we'll measure your performance and bonus based on how much you use Gen AI:)'
  - rsynnott2 hours ago
    Yeah, “you must use LLMs, but also pls don’t use them for important stuff” is a difficult circle to square.
    Gud2 hours ago
    Who said you can’t use it for important stuff? Just because SOME people are screwing up doesn’t mean everyone is.
    hansmayeran hour ago
    Of course you can use them for whatever you want. Its also not disputable that some people will be more careful than the other. The issue however is that the idiots who pushed for widespread usage of AIs in the companies, i.e. clueless MBAs, have also pushed them onto exactly the types you are mentioning - the ones who will screw things over because they are incompetent or don't care, or most likely - are both of those things. So it's not a criticism of people who are careful in their usage of LLMs in critical scenarios - it's a criticism of the morons who bought into the AI hype and really believe an LLM will produce equally great terraform code previously written by 10 engineers at the 1% of the cost.
- VirusNewbie5 hours ago
  GenAI at fault, and nothing to do with amazon laying off 30k people and having an overall shitty culture where people mostly don’t want to stay?
  - applfanboysbgon4 hours ago
    > GenAI at fault, and nothing to do with amazon laying off 30k people
    GenAI is literally the direct reasoning they used for laying off 30k people.
    > “As we roll out more Generative AI and agents, it should change the way our work is done. We will need fewer people doing some of the jobs that are being done today, and more people doing other types of jobs,” [Amazon CEO Andy Jassy] bluntly admitted.
    surgical_fire36 minutes ago
    It's not, and in the latest round of 14k people laid off they were more transparent that it was a result of previously having overhired.
    spwa4an hour ago
    There is a long history of people blaming AI for not being able something totally unfair and me and I do believe quite a lot of probably somewhat older ML practitioners are seriously tired of that constantly happening. Amazon is prioritizing investment into data center expansion over paying employees. And ML ... is present in the building, and about as involved in the firings as the cleaning staff is, only people are scared of AI and so it gets blamed for everything. The firings are driven by imho misguided financial engineering, and it sure as hell is not being being done by ML.
    But what is reported? Management firing people? ML. Engineering screwing up the uptime? ML. Someone screws up their job? ML.
    Don't you know? ML is killing people in Iran today. Not mullahs. Not the military. ML. Obviously that's where the responsibility lies ...
    Usually blaming ML is like suddenly coming up with conspiracy theories like here, or impossible suddenly added requirements, and usually utterly ridiculous ones (like criticizing Deep Blue for not being able to play poker, yes I realize I'm old, but it's a bit like criticizing the very best competition canoe on the planet for it's disappointing spaceflight capabilities)
    Like here: large blast radius AI-assisted outages ... we've all written software, and we all know the problem here: THEIR TESTS SUCK. Probably because they fired all the good SREs for insisting software teams spend time on tests, or demanding integration test failures are fixed before shipping the software.
    By the way: I'd like to point out that in most/all industries where jobs are lost on a large scale the situation is like the Amazon situation: ML is not even remotely involved. So while I get the criticism, it doesn't work like that. The Auto industry first got blasted with very traditional engineering, which worked and depended on very old style mathematics. What's happening in factory automation is 99.9% 3d geometry (to the point that ML, is actually a simplification of the problem). Then the auto industry got blasted with what every industry got blasted with: stuck in demand-limited markets. Every car company can easily build 10x more cars next year, but there's no point: nobody will buy them. So the only thing worth doing for these companies is to produce cheaper ... and that means getting rid of people (when end-to-end taxes on income in Europe are 60-85% and actually rising). With only a few exceptions, these companies find ML too expensive for projects.
    So while I understand "we're defending our jobs", it's misguided ... the big job losses in the west have nothing to do with ML. MAYBE those are coming, but large job losses have been predicted in the last 50 AI "revolutions". 49 times that was wrong. And the actual problem is really a return to 99.9% of history: when it comes to doing what is needed to keep society going 10%, maybe even 1% of people can do it. That means you need something for the other 90% or 99% to do.
    The solution is the only thing that has helped in the past: having the government put on huge public works. From building the pyramids to the Sagrada Familia (and yes, wars. But let's please not do that), or ridiculous engineering projects like Europe and America's rail networks. There's a stable in the Italian alps that has a private rail connection. So fix the problem. I don't know: build a large cathedral in Washington or something. Hell, hire people to make sure it has a depiction of the last supper where every square micrometer of the painting was designed by an AI with 1000-member engineering team, so people can spend their entire life looking at the painting with a microscope and find new details every day. Let's do something "great", in the sense of an enormous effort. Fly 100 missions to Alpha Centauri. Fix the demand-limited issue the economy has. "Do more with more". And stop blaming ML. Hell, I'm currently in an old European city filled with 200-year old buildings. Quaint. Cool. Except ... not really. 90% of these buildings suck. Can we just rebuild 95% of ... all European capitals? Every building that is way too old and has no reason whatsoever to be preserved other than it's currently slightly cheaper ... can we please just rebuild them better? Do stuff like that.
  - nixass4 hours ago
    Absolutely correct. Now let's drop anothet few billions to make AI better and avoid such mistakes in the future. And we might lay off some more folks to make room in a budget for more AI
  - jiggawatts5 hours ago
    Also, managers are incentivised to force AI onto the remaining staff to “boost productivity” but of course they won’t accept any of the responsibility or blame for that decision.
    zihotki4 hours ago
    Just tell the employees to make AI fully adopted in SDLC and make it secure and reliable. Don't make mistakes.
    If it works for models, why not humans? /s
  - aerhardt5 hours ago
    Maybe both, and possibly other causes too, but allow us a moment to revel in the schadenfreude of AI code slop at hyperscale, will you?
  - 5 hours ago
    undefined
jqpabc1235 hours ago
Summary: AWS has voluteered to serve as a crash test dummy for vibe coding.
But don't tell anyone --- and if you do, don't blame AI because it's all the humans fault for not shaping their questions in the "right way".
- arjie4 hours ago
  For this particular experiment, regardless of phrasing, I think the guys with the most appetite for risk have to be Cloudflare. They're shipping at an astonishing pace but I think there have been far more outages than there were before in jgc era. Perhaps Anthropic's application side teams are faster and more cowboy[0] but they are super AI-native so that makes sense.
  0: I think this is the eras cowboys win so they're (unsurprisingly) smart about doing this
  - Rohunyyy4 hours ago
    I am surprised we haven't had an actual Y2K crash with these AI codes. Like how do you review a 1000 lines of Claude generated PR?
    krilcebre3 hours ago
    You don't. I can guarantee that 90% of the generated code will never receive a detailed review, simply because there's too much of a cognitive overhead, and too little time, everything moves too fast.
    I remember having to do such a code review before an AI in a highly complex component, and it would take a full day of work to do it. In this day and age, most of the people i know take like half an hour and are mostly scanning for obvious mistakes, where the bigger problem are those sneaky non obvious ones.
    kakacik3 hours ago
    Exactly. Its same for reviewing somebody else's code. How many companies did this perfectly before llms came? I know mine didn't. But these days people that aren't senior enough do reviews of llm output, and do a quick mental path through the code, see the success and approve it.
    What could work - llm creating a very good test suite, for their own code changes and overall app (as much as feasible), and those tests need a hardcore review. Then actual code review doesn't have to be that deep. But if everybody is shipping like there is no tomorrow, edge cases will start biting hard and often.
- bootsmann4 hours ago
  This wouldn't happen if they used my CLAUDE.md of course!
- blitzar3 hours ago
  They were holding it wrong.
nwmcsween3 hours ago
> Junior and mid-level engineers will now require more senior engineers to sign off any AI-assisted changes, Treadwell added.
Beatings will continue until senior engineers leave?
- rsynnott2 hours ago
  I wonder what senior means here. Like, unless it’s fairly junior seniors, the ratios are going to make that impossible.
- 3 hours ago
  undefined
alecco34 minutes ago
They are trying to cure the symptom. The actual cause: one of the most toxic environments to work as a developer.
Another case of AI as a scapegoat.
rhubarbtree4 hours ago
Some engineers will point to this and say, hey, AI is not gonna work. It doesn’t reason very well and it leads to these problems.
But what they’re missing is all code quality is going to tank, and we are just going to accept that. Just as artisanal goods were replaced in the Industrial Revolution with mass produced inferior ones.
People will accept bad code if it is cheap enough.
We’ve gotten used to aiming for great, even if we often only hit functional. The new bar is going to be so much lower. Welcome to the era of cheap bad code. Lots more software, lots more value overall, but much worse reliability. Every day the apps I use get buggier.
- rendaw3 hours ago
  I thought this too, but it's still weird.
  Machines that make e.g. paper are great. They are immensely more efficient, but extremely consistent and superhuman (try making that perfectly smooth letter paper by hand).
  Human written software is the same. Where you had N people copying data from spreadsheets for M suppliers into an internal database or whatever, you now have one program doing it. It can be scaled infinitely for a fraction of the cost. It _never_ messes up. The cost of the software developer is trivial in comparison. Software was a space where the marginal cost for quality was extremely cheap.
  I don't get how AI fits in here. Software already had massive scale. You aren't replacing a massive data entry team with AI, you're replacing a reliable piece of software written by a human with a reliable (?) piece of software written by AI controlled by a human. There's no increase in scale. Until the reliability issues are fixed a very noticeable decrease in reliability (sure, some software was bad already, but now the good developers are also writing bad code).
  This doesn't seem like a natural step to me at all. The best explanation I can come up with is AI is just being used as an excuse for destructive penny pinching.
- ozgrakkurt3 hours ago
  You are comparing code to a tshirt but it is more similar to infrastructure like roads/bridges/buildings. It is like a platform that you build other stuff on top of
- rsynnott2 hours ago
  I don’t totally buy this. If you’re Amazon, there’s only so buggy you can get before you start losing huge amounts of money.
- idiocratic3 hours ago
  The economics of software are very different from physical goods. Margins on software (products) are orders of magnitude higher. Any cost shaving done at coding time is economically irrelevant in the long run, detrimental to quality/reputation and could almost be seen as a risk. Furthermore, assuming the bottleneck in this process has so far been coding is pure BS.
  - Ravus3 hours ago
    > assuming the bottleneck in this process has so far been coding is pure BS.
    This is the core insight for most businesses.
    When evaluating the impact of AI on velocity, the first thing to consider is how long it takes for a one-line code change to get into production, including initial analysis and specs.
    You can't get faster than this.
  - rhubarbtree3 hours ago
    The cope island of objections will continue to shrink.
    Being able to easy create apps means huge supply, which means commodification of software just like the commodification of physical goods. Mass supply means low prices. It won’t be economic to have artisan coders any more than to have artisan goods makers.
    yladiz3 hours ago
    And yet people still want artisan goods, artwork, high end food, things that aren’t “economic”.
- gtsop4 hours ago
  You are almost right. As I say since the beginning of this ai circus, this is the equivalent of flipping mcdonalds burgers (no insult intended for those workers). It is a thing, and people buy and eat them. But high quality burgers made by talented chefs will always be out there. That's my analogy, and i dont intend to be on the side of flipping mcdonalds burgers
  - rhubarbtree4 hours ago
    There are a lot of McDonalds and very few Michelin starred restaurants.
    Safety critical engineering and infrastructure layers will (eventually again) be rigorous. Everything else is headed to slop.
    My craft died. I’m sad. Time to move on.
    kakacik2 hours ago
    Where I live, gourmet high quality burger joints definitely, and massively overwhelm McDonalds in number (Geneva, Switzerland). Even if I count in burger king. Shows that sometimes people pay for the quality even if they don't desperately need it. And its trivial to make better burgers than mcd, heck I can surpass them trivially at home with every ingredient, they are really the lowest level of quality, taste, looks, or (lack of) healthy components. You don't need Michelin * for that, far from it. Plus food is often cold outside of peak hours, something that never happened to me in proper restaurant.
    Also, mcd ain't at the end much cheaper, just marginally, the choice of drinks is pathetic, usually no beer. The main reason folks go there because its easier/faster than getting table in real restaurant. But also the environment in mcd is absolute soulless cheap fugly shit. (there are kids corners to be fair, but they are often disgustingly dirty).
    Its a very good analogy at the end IMHO, maybe just not tilting the way you intended, at least not here.
  - rsynnott2 hours ago
    It’s really not. McDonald’s’ whole thing is consistency. It’s never going to be good, but not is it going to be that terrible.
    That is, ah, very much not the case for AI slop.
  - nottorp4 hours ago
    > high quality burgers
    There is also, you know, actual food. Done by real chefs.
    gtsop2 hours ago
    [dead]
bravetraveler3 hours ago
When you hear "left behind", remember: is 'it' going to places you want?
- MOSI33 hours ago
  And if it's going to get easier and easier for my work to be performed by AI, then what does it mean for me to "keep up"? Do I just need to create more slop than anyone else?
  - bravetraveler3 hours ago
    Excellent consideration, probably. Sounds like a lot to do for very little in return. I'll leave with this, a sort of sick joke given context:
    Quit when the work is done
    MOSI32 hours ago
    Fortunately my job is not based on generating plausible-sounding bs, so I should be safe.
    bravetraveler2 hours ago
    Hear, hear. Wish you the best! There's a whole list of other silly games to avoid, unfortunately. Namely, "up or out".
    For instance, I want to engineer [more]. Closer to management or sales due to scope creep. At this rate, by career-end, I'll be operating a small country by myself.
mediumsmart5 hours ago
Is it only 45 dollars for the subscription? Does that cover the AI-related outages too or just the engineering meeting
pinkmuffinere3 hours ago
> The group has disputed the claim that headcount cuts were responsible for an increase in recent outages.
It's a bit hard to believe this.
5 hours ago
undefined
jcgrillo4 hours ago
> Junior and mid-level engineers will now require more senior engineers to sign off any AI-assisted changes, Treadwell added.
Lol. Lmao. You have got to be joking. Seniors leaving in droves is how that plays out.
- onei3 hours ago
  I read that line and thought "so, the solution is code review?". What has to happen to your processes that code review is not only missing, but unironically claimed to be the solution?
  I know there are some companies that never did code review, but this is Amazon. They should know better.
  - rendaw3 hours ago
    It's _more_ code review. They already had senior code review.
- wrxd3 hours ago
  This is going to end either with seniors rubber-stamping absolutely everything without even reading or with seniors blocking most of the slop for no overall productivity gain
  - Ekaros3 hours ago
    Or if review is actually done I think there will be productivity loss. Juniors with help of AI can generate more code than seniors have time to review in full working day. So they won't have time left for any other work...
- Rohunyyy4 hours ago
  Nope. NO ONE is quitting in the current market because they got asked to review extra PRs.
  - rsynnott2 hours ago
    If you’re a senior at Amazon and your whole job becomes reviewing slop, well, you can likely get another job which does not revolve around reviewing slop. The current market is not great, but it’s disproportionately painful for juniors.
  - kakacik2 hours ago
    Top people definitely do if they feel like it, why the heck shouldn't they. There is no shortage of work for those. But its fine if company, via its actions, claims it doesn't want to even retain its top talent. Just market forces and all that.
o104493665 hours ago
Paywalled
- techterrier5 hours ago
  paste headline into google, click first link
  - kqr5 hours ago
    Huh, it has to be Google, specifically, too! There used to be a shortcut for this action on HN (a link under the submission saying "web" or something?), but it seems that has been removed.
jamiemallers2 hours ago
[dead]
shablulman5 hours ago
[dead]
shablulman5 hours ago
[dead]
andyjohnson04 hours ago
https://archive.ph/wXvF3
kerim-ca5 hours ago
Full Article
Amazon’s ecommerce business has summoned a large group of engineers to a meeting on Tuesday for a “deep dive” into a spate of outages, including incidents tied to the use of AI coding tools.
The online retail giant said there had been a “trend of incidents” in recent months, characterised by a “high blast radius” and “Gen-AI assisted changes” among other factors, according to a briefing note for the meeting seen by the FT.
Under “contributing factors” the note included “novel GenAI usage for which best practices and safeguards are not yet fully established”.
“Folks, as you likely know, the availability of the site and related infrastructure has not been good recently,” Dave Treadwell, a senior vice-president at the group, told employees in an email, also seen by the FT.
The note ahead of Tuesday’s meeting did not specify which particular incidents the group planned to discuss.
Amazon’s website and shopping app went down for nearly six hours this month in an incident the company said involved an erroneous “software code deployment”. The outage left customers unable to complete transactions or access functions such as checking account details and product prices.
Treadwell, a former Microsoft engineering executive, told employees that Amazon would focus its weekly “This Week in Stores Tech” (TWiST) meeting on a “deep dive into some of the issues that got us here as well as some short immediate term initiatives” the group hopes will limit future outages.
He asked staff to attend the meeting, which is normally optional.
Junior and mid-level engineers will now require more senior engineers to sign off any AI-assisted changes, Treadwell added.
Amazon said the review of website availability was “part of normal business” and it aims for continual improvement.
“TWiST is our regular weekly operations meeting with a specific group of retail technology leaders and teams where we review operational performance across our store,” the company said.
Separately, the company’s cloud computing arm — Amazon Web Services — has suffered at least two incidents linked to the use of AI coding assistants, which the company has been actively rolling out to its staff.
AWS suffered a 13-hour interruption to a cost calculator used by customers in mid-December after engineers allowed the group’s Kiro AI coding tool to make certain changes, and the AI tool opted to “delete and recreate the environment”, the FT previously reported.
Amazon previously said the incident in December was an “extremely limited event” affecting only a single service in parts of mainland China. Amazon added that the second incident did not have an impact on a “customer facing AWS service”.
The FT previously reported multiple Amazon engineers said their business units had to deal with a higher number of “Sev2s” — incidents requiring a rapid response to avoid product outages — each day as a result of job cuts.
Amazon has undertaken multiple rounds of lay-offs in recent years, most recently eliminating 16,000 corporate roles in January. The group has disputed the claim that headcount cuts were responsible for an increase in recent outages.
- scuff3d5 hours ago
  Gonna see a lot more of this in the coming years. The real cost of LLM tools has a delay. Devs don't tend to notice it until they're neck deep in code then don't understand, swearing the next prompt will get them out. CEOs won't notice until it starts costing them money, and that of course assumes anyone will be willing to admit it. Lot of people have their careers on the line spending a metric shit ton of money on untested tools.
potetoooooo5 hours ago
nice domain
wiseowise5 hours ago
Hold a meeting?! No way! That’s a news worthy material!
Seriously, who even cares? It’s probably going to be “guys be careful but also continue to push slop kthx”.