Railway Is Having a Major Outage(status.railway.com)

51 pointsby kgraves2 hours ago14 comments

fjni19 minutes ago
Wait… railway runs on GCP? Didn’t they make a whole thing about not “building a cloud on top of another cloud?”
Or did they just mean that they’re not renting VPSs but only metal from the cloud provider?
In my mind I was so excited that there was another provider not just paying one of the hyperscalars but at a minimum colocating and owning more of their stack. https://blog.railway.com/p/heroku-walked-railway-run
- miniman13379 minutes ago
  from the blog linked via Wayback Machine. "From Day 1, we had this notion at the forefront.
  The other notion that we have intuited is that you can’t build a cloud on another cloud. We have devoted years of practice running our own metal (and playing well with other clouds) to make sure that Railway’s business, which invariably becomes your customer’s business, is as rock solid as possible."
- eoswald16 minutes ago
  Yep, and this is why I'm pissed. They lied. They're completely dependent on GCP. So, I gotta do some research, i need something a little more stable (and less dependent on one company's whims) than this. This is bad for them, because it really strikes at the heart of their 'big claim,' peacefull software deployments. This is chaos.
  - ndneighbor5 minutes ago
    Yea, I mean, that's the whole MO of our platform and we failed at that. So yea, that's disappointing and more so for our customers.
    I can provide an explanation about the GCP dependency. Yes, we have host workloads off GCP, and we have been able to build a good business by performing a cloud exit. However, we were worried that we would have a circular dependency on our own cloud. I don't think we expected to get auto-modded out of our own account, hence we left our DB on CloudSQL.
    It was never our intent to deceive people that we didn't own our own destiny with our business. The last GCP issue, we were assured that this scenario wouldn't happen (when we got auto-ratelimited, which was bad, but survivable) - but it seems like we have further work to do. Apologies.
eoswald44 minutes ago
Sorry, I have a hard time blaming Google for this, when Railway seems to be having increasing trouble keeping the platform stable. Something like this should NOT take down an ENTIRE service. There should be a backup when literally your business is about being the reliable backend. This just seems like poor planning to me.
- ryanisnan30 minutes ago
  I don't quite know what you mean. Do you really expect Railway to use a multi-cloud architecture to host all of their client's projects? I suspect that would lead to a lower availability, all things considered.
  - eoswald18 minutes ago
    Well, in the same token, is it smart to base your ENTIRE architecture on a single cloud architecture? Isn't that why some of us build in fallbacks for AWS-hosted services? I mean, their enitre platform, both public and private facing, is running on the same thing. One error, one problem, takes out the entire service.
  - impulser_21 minutes ago
    They literally own their own data centers. That's whats surprising about this. They are lying to their customers when they say they operate their own data center because obviously they don't if everyone's apps are down with GCP blocking their account.
    ryanisnan19 minutes ago
    Oh, I see what you mean. Eh, it's possibly the same reason that AWS essentially goes down when us-east-1 goes down.
- cactusplant737435 minutes ago
  Disaster recovery is pretty expensive, right? Especially for their size.
Avicebron16 minutes ago
Isn't Railway the "the API key to delete the backups is in the prod database, because that's where the backups live duh" guys?
enahs-sf21 minutes ago
I respect what railway is doing but also would never run my business on such a platform.
- eoswald14 minutes ago
  Today changed my opinion on them completely. Was willing to give them the benefit of the doubt that they're growing fast, but now seeing that they've failed to scale properly, and are missing little things that become big things later. I can't take that risk.
- dpark16 minutes ago
  That kind of sounds like you don’t respect what they are doing.
Mengkudulangsat30 minutes ago
That explains why all my vibe-coded hobby projects are down.
Thank God I'm not dealing with any public-facing sites! Would have been an expensive lesson for a newbie coder if my job depended on this.
faangguyindiaan hour ago
Google cloud also locked out a Korean Goverment Organization recently. The guy posted on GCP subreddit.
Google really need to improve their support team. It's strange such a big corp can't even afford to have proper support team.
- danpalmer23 minutes ago
  > It's strange such a big corp can't even afford to have proper support team
  Railway say they are in touch with that support team.
- King-Aaronan hour ago
  > It's strange such a big corp can't even afford to have proper support team
  This seems to be by design.
  - ndneighbor3 minutes ago
    We have a CSM, Head of Customer Support contact, and further contacts with GCP. Despite that, we still had this issue.
brokenodo32 minutes ago
I’m a new customer and have been falling in love with Railway over the last 2 weeks, but this is quite the wake up call.
- csw-00130 minutes ago
  Literally in the same boat. I've been really happy with it, but this is a major eye opener.... It's been done for a looooong time by provider standards.
  - reelvideocap24 minutes ago
    same
throwaranay49332 hours ago
This screenshot from Discord suggests the idea that the outage is caused by automated GCP account ban: https://x.com/acgfbr/status/2056866780866351323
ryanisnanan hour ago
Yikes. I was wondering why my TLS certs were coming up as invalid.
bshack026 minutes ago
so....what are we switching to y'all? cloud-run ? ;P
- auxiliarymoose17 minutes ago
  federated hardware (a bunch of raspberry pis networked into a high availability kubernetes cluster, hidden across various local coffee shops for free power and bandwidth)
- throwatdem1231113 minutes ago
  raspberry-pi cluster in my closet
mcontrerazCLan hour ago
all my fkn postgres bd in railways! what do i do now?
- eoswald7 minutes ago
  Hahah at least you're not getting called every five minutes because you cant shut off the alerts, because its apparently deployed SOMEWHERE but good luck finding how to access it. Can't wait to see the bill from Twilio because of this lol
- cactusplant737432 minutes ago
  Take a walk. Breathe in the fresh air. It feels good.
iloveplants2 hours ago
seems like it's every day
rekabis27 minutes ago
TL;DR: putting all your eggs into one basket is bad, man.
- lfx19 minutes ago
  That’s true, however having only few eggs and shopping for several baskets does not make sense in early days. Not sure how big railway is, but usually you start small with one egg.
  - christophilus7 minutes ago
    You’d think they wouldn’t have started with GCP. There are plenty of datacenters where you can buy racks and racks of servers, and talk to a human when something goes wrong, and even walk in and access your servers. That’s what I’d be using if I were to build a Rackspace today.
    tomschlicka minute ago
    They started on GCP and have been migrating to their own "Metal" DC doing exactly what you're describing. But GCP is still their overflow given how rapidly they are growing and holds some amount of networking that routes to their DC.
bshack026 minutes ago
so...what are we switching to yall? cloud-run :P