Which certainly made me shit myself, briefly.
The day it broke away and became centralized was when we had a PR + mandatory "Required actions" to merge to main.
Gosh, it's hard figuring out what changes Lorne made if only we had a system to merge those changes. Enter git
Gosh it's hard figuring out what packages Rachel had to make this work. Enter rubygems/pip/npm
Gosh it's hard figuring out sync these changes across a network. Enter github
Gosh it's hard figuring out how to get those packages working on my operating system. Enter docker
Gosh centralizing our distributed version control software system onto one website is getting really unreliable. Enter fossil(?????)
If we go any further having one computer per business with a sign up sheep is starting to sound pretty fucking attractive.
Just set up a Kubernetes deployment and you’re set.
But as others mention, GitHub’s primary strength is collaboration. If you want decentralized, solve this by creating a decentralized collaboration tool on top of fossil and/or git.
For example, how to do pull requests and code reviews?
being a host for git repositories has never been its core competency. neither has its groupware offering.
does it even serve OSS well? a very interesting criteria is, "Have mature or adopted end-user-facing OSS recently merged a large PR from an unallied contributor?" The answer is overwhelming no. This is why there is so much innovation in this space.
Proudly self-hosting Forgejo since then.
> Our team is currently experiencing an unexpectedly high volume of tickets which has resulted in longer response times than we prefer. We acknowledge the long wait and apologize for the experience.
> Sometimes our abuse detecting systems highlight accounts that need to be manually reviewed. We've cleared the restrictions from your account…
Fully self-hosted IMO can be an overcorrection. The issue isn’t “relying on other people”—it’s relying on GitHub, when they’ve made it clear they don’t care about uptime and they don’t care about support turn-around-time.
It would be a pain as I'd have to set up a few integrations again, but github is far lower down the risk scale than the vast majority of SAAS providers
Is it true that official service status pages are updated automatically?
Depends. Typically no because there’s an art to crafting the actual message around impact… but sometimes yes it is automated
If the first they hear of an outage is when user requests start to fail, then that's a failure in their monitoring as well.
But effective monitoring is harder than people assume.
Who says public status page equals internal monitoring.
They likely know faster than you. Whether they post it publicly is a different issue (hint: SLA penalties, news impacting stock etc)
Are you sure you’re replying to the right comment?
Isn't that what monitoring actually is? The issue seems to be in their testing, not monitoring.
There are synthetic tests, where you can generate API request calls or even simulate an entire user journey. These allow you to control the user agent, the payloads, and thus you know anything errors back are actual errors. These are triggered by the observability platform (think like running a cron-job) and thus you're not tied to user activity to see when problems arise.
There are other metrics outside of HTTP response codes too. Think like free RAM, CPU usage, disk space, etc. This is just naming some obvious ones because these types of metrics are generally bespoke to the type of application your monitoring. And with these types of monitors, you'd not just have an alert when things have failed, but ideally have alerts when an irregular trend is showing that things are likely to fail too. This latter type of monitors helps you get ahead of the problem before it become customer facing.
Then you have more traditional stuff like logs. This will also be bespoke to the application. But you'd expect errors in logs to get surfaced quickly. Assuming Github have good hygiene in what's being logged.
Tie that up with APMs, RUM, and other goodies like that and you'll have diagnostics to investigate issues when they appear.
(this is just a super high level view of observability too)
You should not alert on cpu, ram, etc
It doesn't "need" that. That just how most people set it up because it’s an easy sane default that allows for network jitter without inexperienced engineers thinking about different conditions triggering different types of responses.
If you’re measuring internal APIs from an observablity solution that’s has nodes already inside you’re network enclave, then there is a strong argument for alerting early.
> You should not alert on cpu, ram, etc
That’s not true to say as an absolute statement. And a generalisation it heavily depends on the system your monitoring and how it behaves under pressure.
But in any case, I wasn’t suggesting CPU alerts were the end goal. I said:
> these types of metrics are generally bespoke to the type of application your monitoring.
Ie you’ll use metrics but those metrics will be highly specific.
The CPU examples were an illustration as to what a “metric” is (it might seem obvious but not everyone is an expert) but the point was HTTP response codes aren't the only types of metrics one should be capturing and watching.
If your requests are fast and cheap, you can probe frequently relative to your goals, but often that's not really possible (think, long SQL queries, or scheduling a container/pod). There you need several datapoints, or possible fewer augmented with other signals.
Talking about long SQL queries, I quite like throwing CPU alerts on database servers. They'll be a low priority alert (ie no out of hours "pagers") so just something that goes into a slack channel. But they're a good indicator of when developers have poorly optimized SQL, or the DB schema is poorly defined (eg missing indexes), or the DB server itself is poorly sized.
This wouldn't be something you'd expect to need in production and definitely not something you'd rely on as a notice of a production outage. But it is an example of one of those 1% occasions where a CPU alert does add value to the overall observability of the application.
But this also ties into your excellent point about how you'd use CPU and other data points to build a picture of what's happening in your application.
idle CPU is often wasted CPU
Maybe the Github Actions infrastructure isn't run like that.
edit: my oncall rotation notified on all 500s, 24/7, not just rates - https://news.ycombinator.com/item?id=48279262
Recently there was this: https://news.ycombinator.com/item?id=47252971 "10% of Firefox crashes are caused by bitflips"
Which makes me think a small amount of random issues which happen even though nothing is broken, is normal everywhere. Especially once move things around on a network, there's potential for a lot more random errors.
It does require constant tuning and adjustment though.
This is why data hoarders who have NASes with lots of space insist on running their servers with ECC RAM despite it being significantly more expensive. Because bit flips, for all intents and purposes, cannot happen. The RAM itself detects and corrects for them.
I wouldn't expect bit flips to be a significant contributor to enterprise problems.
I know all of Gmail, every GCE service I can think of, every AWS service I can think of, Amazon.com, Netflix, and Github all do not page on just a single 500.
I know none of those are particularly "high performance" though. Curious where your experience is coming from.
I had a fairly long tenure, where I maintained multiple key services in critical online payments flow. Authentication, authorization, core business and risk data, as well as some cross-cutting control plane stuff, etc. You needed one or more of our services to take a payment, serve any request from the employee dashboard - pretty much everything hit our services. The entire company ground to a halt without my team.
We paged for every single 500. In instances where a particular class of 500 was spurious or not worth fixing, we would leave it acked or mark it as noise. But typically we'd just put in a fix as soon as possible so we didn't page.
Our graceful shutdown and traffic shaping stack was great, but occasionally we'd get a few pages during deploys or failovers.
Oncall was typically not bad, but when it did get bad it was terrible. I've been involved in huge outages that cost hundreds of millions of dollars. Usually it was the fault of multiple teams having compounding runaway failures rather than one service or bug in particular.
It's inexcusable to have a customer's payments not go through. We engineered around resilience. We had strict five nines SLAs and p99 targets and evaluated our adherence with even the smallest partial outage. Hundreds of other services depended on ours, and downstream impacts were huge, so we had to keep a tight ship.
We didn't have "business hours"-only paging either as our platform was available globally, including a heavy install base in Asia.
Assuming the existence of some kind of network (with zero guarantee of 100% reliability), how does this work in practice? Is each 500 treated as an event that needs investigation, even if the result of that would end up as 'a router dropped something from an internal buffer but the transaction as a whole was re-tried by a parent so the service itself recovered'?
Even if it's "DB in datacenter I tried to save to was hit by meteor" event, you can cater for this not to result in 500 (ie - DB unreachable, retry in a couple of minutes); the question is if you want to.
If my DB health check endpoint is returning 500s for N consecutive checks over M minutes, yeah, please wake me up at 3am!
If one user hit a weird edge case in form validation and got a one-off 500, please don't! We can fix that on Monday.
Not always easy to distinguish those clearly or configure those business hours rules, but for my team at https://heyoncall.com/ that is the goal -- otherwise your team burns out fast. Waking up someone at 3am has a real cost, so you better be sure it's worth it.
As others have said, follow-the-sun type models do exist, usually staffed by people in their normal working hours (EMEA, Americas, APAC) but this means you've still got to cover the weekend and public holidays (which there are a lot of when you factor in plenty of different countries).
Where you need a quick response you can have a core ops/noc team that looks at things with lower thresholds and shorter windows, and their job is to do the initial triage and then page the appropriate team earlier than they would have been alerted by their own alert thresholds/monitoring.
Actually clicking the button to change the status on a public status page is a whole different topic that becomes very political in certain companies.
I'm sure you're not in ops. Or in a dev org of a service with decent request rates.
What you're asking for is a service to fail silently. There's no way a service with a decent request rate to have 0 500s. Not when it still sees development.
A 50 year old bank API? Maybe...
But if it is synthetic queries sent from the monitoring platform, then you control the user agent, payload, and endpoints. So any failed requests are a symptom of a misconfiguration and/or failure that should be investigated. Albeit not necessarily as a P1 priority.
Is it more so to have something to link to for managers who aren't using the service have a pretty bar to look at and feel like they are "doing something"? Or is it more of a kind of a way to prevent confirming what you already suspect to be true. E.g. "Huh. Me and Jim are seeing problems. How about you Tom? Oh wait, crud. The service page is confirming it's down now. Never mind! Who wants coffee?!"
https://www.reddit.com/r/GithubCopilot/comments/1toa9tf/mode...
I vibe coded a script that interacts with both Gitlab and Github via their APIs and I've been using it pretty heavily since this morning. I crossed the streams! Goodness, I didn't know it would be _this_ bad!
So why are Actions so unreliable anyway? Occam's Razor would probably suggest the domain is inherently complex/difficult; but other providers show that reliability is possible. What would Occam's Razor suggest next? Poor management..?
You’d need at least some hash of sources + test results, and check that it matches that (in CI).
And you’d still deal with environment differences.
Reasonable concern. In ~10 years of indy development, I haven't forgotten to run tests before pushing to main, ever. So setting up and maintaining complicated machinery to solve a problem that could (but never has) happened doesn't justify taking focus off other more important things, namely building.
The benefit probably increases with team size (I'm a team of 1, so I appreciate the luxury of being able to dodge CI/CD entirely).
Say a disaster happens and someone pushes to main without running tests, 9 times out of 10 it will be of ~zero consequence (either the code works first time, it was a cosmetic change that hardly affected users etc).
I know there are horror stories and CI/CD would have prevented some of those, but IME they're just not that common nor severe for small operations, and even when they happen, only a small subset are irreversible/unfixable.
The latest language models have enabled this sort of thing for me. I can integrate a mini Jenkins into every project within a 5-10 minute prompting session. This sort of code isn't hard. It's just tedious, and the LLMs absolutely rock at boring repetitive stuff. Having a win32 service start up successfully on the very first try is something I haven't experienced until 2026.
I agree in a hosted+shared SQL scenario you have to be a little bit more careful with all of this. Arguably, you should have a separate schema management phase in these cases.
But if you are just SQLite embedded in the service, you can use the user_version pragma to track schema version and perform deterministic migrations (assuming a user didn't manually jack with the file in-between).
"Update something in the cloud" <- What do you mean?
That only works on extremely simple setups and has risks. If you have only a single server, you can stall it. Now, how to roll back?
Been burned too many times on that one.
Move to EC2.
Darn AWS is down.
Alright, run it on a Mac Mini in your basement. Ahh dawn, your ISP is having issues. Good thing you have a backup 5G hotspot.
Ohh no, the power is out.
Eventually you have to trust someone else.
GitHub is a tragedy of the Commons. Too many people are using it, and Microsoft isn't willing to handle it correctly.
Feels like a very good business opportunity. Minimum 50k yearly contracts, GitHub with actual uptime. GitPro ?
Aggregate risk is too high.
This is supposed to be Hacker News! Who is coming up with a startup to fill the gap !
You should never entirely depend on a third party service to run your tests, either.
We can't be blocked here. Seems silly what we settled on this, but for a long time GitHub had been reliable enough for many years, but things are sliding down the pan as of late.
On my repo the jobs do not get scheduled on the PRs at all, so I assume that separation wouldn't help for todays issue.
Wait until you charge you for self-hosting runners.
Oh wait. They already tried.
You can now hire me as an overpriced consultant instead of paying Microsoft.
(Ofc, in a sensible universe, we just brush that off to a JS/Firefox glitch or my ISP.)
And yet, here I am. My code is not compiling, my AI isn't vibing, nonetheless I can't work! Two more hours before I can get off!
For instance, the UI at setups such as https://git.devuan.org/Daemonratte/gtk2-ng is quite ok-ish, in my opinion. Granted, it is mostly copy/paste from github but that still is about 1000000x better than sourceforge's interface - and gitlab's UI too (I just hate gitlab's UI, they seem to love complexity and a billion features only 0.000001% ever need; GitHub, with all its faults, is for the most part really simple - not everywhere, e. g. GitHub wiki setup sucks, but by and large I think it is simple overall).
Anyway. Forgejo's response to it: https://floss.social/@forgejo/116494295922963052
No, it's not like "act," because it uses the standard Github runner, the difference is that the control plane is an emulation of api.github.com, because of this we can do all kinds of nice things:
Caching in ~0 ms. Pause on failure, so you can let your AI agent fix it and retry without pushing.
Is what it boils down to.
> codex "Fix this pipeline, use `act` to verify your changes"
I have tried to use act many times, and many times I've failed.
P.S. pause on failure is also helpful for humans, but I'm trying to be realistic about where the future of programming is going...
I like that it exists, but what a freaking mess that it's necessary and so difficult to do.
I started playing with proxmox VMs and containers in them (docker and tart) to see if I can build some local infrastructure to properly solve this…
The jobs runs via containers.
For Git, all you technically need is ssh access and some backup strategy for your server. It would be bare bones but workable. And there are of course plenty of OSS things that are a lot nicer than that.
I'm still using gh and gh actions and we are mostly below the freemium layer with that. But it is kind of slow and honestly a dedicated vm plus some high CPU/memory workers we can spin up on a need to have basis might be a lot faster. With GH outages becoming more common, my hand might be forced a bit.
In recent weeks, I've spun up listmonk (mailing list solution), matrix (as a slack alternative), and a few other things specific to our software stack. A github alternative would be more of the same. We don't need a lot.
The main objection is that with more moving parts to worry about, the workload for me also increases. Things need updating, monitoring, backups, alerting (and responding to alerts), etc. That sucks up my time and that is scarce.
Another reason for self hosting these days is that with agentic AI tools, self hosted things are a lot easier to integrate into agentic systems. If it is self hosted, you don't have to worry about API limitations, rate limitations, walled gardens, etc. All the traditional SAAS silos are becoming a problem from that point of view. The more locked down it is, the bigger the motive for moving away from it. That's why we ditched Slack for Matrix. Slack is hopelessly locked down and tedious to deal with. Matrix is super easy for this.
Technically Dropbox is just rsync.
Also https://xkcd.com/1319/ but for maintenance.
I don't think vibecoding at Github has much to do with it.
That makes sense. Thank you!
I don’t buy the excuse. I want to hitch my wagon to those “mysteriously lucky” competitors. (And have. And haven’t had similar issues to Github, since.)
Tough to say as this is all speculative, though.
Think critically.
agentic "ai" is going great
That being said there was a noticeable trend starting around 2022.[2] That being said they’ve also been doing a big migration to Azure. It’s likely a combination of things.
1: https://www.cnbc.com/2025/04/29/satya-nadella-says-as-much-a...
It is relatively easy to scale a collection of simple things to extreme and exhibit complex behavior together. It is a lot harder to scale something complex to extreme. But too many times the latter is the default - designed wrong from the ground up and stuck in scaling hell.
If Google owned GitHub would they be better positioned to scale?
I much prefer Woodpecker CI, which is an open source fork of Drone.io. It supports multiple Git backends like GitHub, Gitea, Forgejo, Gitlab, Bitbucket. It supports running jobs locally, on Docker, and on Kubernetes. And there's autoscalers built in for AWS, Hetzner, Linode, Vultr, and Scaleway. There's a bunch of 3rd party plugins (https://woodpecker-ci.org/plugins) for custom integrations. The UX is also very simple, with OAuth used not only for authentication/authorization but also setting up & accessing repos. The system architecture is great, with separate components that run stateless connected to a database, and a custom plugin is any program that takes environment variables and does stdio. The config file is a good balance of ugly YAML and convenience syntax like shell-style parameter expansion variables.
It probably takes less than 15 minutes to install, set up, and run WoodpeckerCI for a small team, so it's not a big investment to try out or host. With the autoscaling plugins it lets you scale your workload up to whatever size. Honestly you could run it on a laptop since it's written Go.
(to clarify for beginners: the config file docs are found in a section called "workflow syntax" (https://woodpecker-ci.org/docs/usage/workflow-syntax) and variable parameter expansion is buried deep in an environment variables page called "string operations" (https://woodpecker-ci.org/docs/usage/environment#string-oper...). poorly organized docs aside, the system itself works well)
I used to use Cirrus CI as an alternative to GitHub Actions and am looking for a new alternative. I wonder if Depot could fit in the same way for my needs. I need to run builds and tests in Windows, Linux and macOS.
Hope you don't mind the public ask, it seems useful for others.
If we're using depot runners, and want to use them directly, or move off of github actions being the controller for when things run: what do you suggest?
Trigger the workflows directly on depot via CLI?
We’d need more details around what you’re seeing. It is true that if auth across GitHub is broken than we can’t copy your actions out to be used by Depot CI. However, we have a solution in the works for that as well.
In short, Depot CI, our own engine and control plane is not dependent on upstream actions control plane. But still has to listen for commit events to know if/when to run jobs on things like PRs. This to is being removed in the future.
We're now considering Buildkite (apparently they have a GH actions migration tool) or self hosting something (GitLab CI, maybe even Jenkins), as it looks like that would've kept ticking over since we're still seeing webhooks being triggered today during the downtime.
https://www.blacksmith.sh/ and https://runs-on.com/
They also say that they're much cheaper than github
Setting it all up would have been tediously annoying eight months ago (Buildkite requires setting up GitHub webhooks for each repo).
Last week I just had codex set up everything, ephemeral vm runners and all, using a couple of low-spec refurb mac minis, Buildkite’s API, a short-lived API token, and migrate my repositories one by one.
So far so good, it’ll pay for itself within two to three months, and following today’s outage I suggested at work that we experiment with the same set up.
They’re considering it.
Jesus, that's both horrible and seems within reach.
The external page linked above goes the other extreme and considers it a bad status whenever any individual service is degraded.
In reality the majority of people only use 3 or 4 of the core services the majority of the time but since there's no "core services" SLA/uptime the usability of github for the majority of people is slightly obfuscated.
GitHub was, once upon a time, quite stable. Things have changed: more features, more usage, and automated agents.
"Microsoft’s GitHub was positioned to win the AI coding race. Outages got in the way" - https://www.cnbc.com/2026/05/22/microsoft-was-positioned-to-...
Something’s wrong when my own infrastructure is more reliable than Microsoft’s.
Even though it's selfhosted and we don't have a dedicated infrastructure team, I don't remember it ever being down in the last 12 years I have been working here.
EDIT: sorry i meant this rant at the one complaining for the free service not for the paid customers (which is unacceptable)
Reminds me of the occasional “JavaScript developer tries to vibe debug a Linux kernel issue” comments we get here.
Thanks for pointing out that nobody is using that thing
- GitHub
- Hiring budgets
- RAM (/personal computing in general)
- Electricity
- Media/Content
- Truth
The open source contribution model as we once knew it is dead; you're not going to accept patches from random agents. The risk is way too high. And you can see that increasingly "AI Slop" makes it difficult to be a maintainer of any semblance of a popular repo.
So what's the value? A durable place to store work? hah.
Discovery? That part of Github has always been shitty.
So that leaves.. Github Actions? The thing that is down every other day and has been the subject of a few ~rug pulls~/attempted price hikes that are almost surely coming back?
We have already seen this in the last some weeks, but now this has become a meme that keeps on giving. GitHub down! GitHub up again. GitHub Down! GitHub ... ...
This is a conservative estimate assuming linear growth, the actual number is likely going to be higher. Much higher.
It's not too hard to grow 14X YoY if you start from a hundred customers. If you have hundreds of millions? Yeah, not so easy.
With all the recent negativity – how are they not even TRYING to fix the damn thing?
Or maybe it's before the GitHub internal devs are online and deploying changes.
I've done some hacky shit in CI scripts, but none made me more mad than that one.
Perfect timing that we post https://www.jxd.dev/writing/building-plain just as this latest incident started.
Self hosted Gitlab with self hosted (or AWS) runners running your pipelines.. We only use Github as a mirror for our public repositories.
I am trying to refrain my "off topic" rants... but such microsoft github abuse is generating so much hate due to their dominant market position, it is hard.
This is why we don't use Github Actions, kids.
Seriously, its a proprietary build service that puts the keys to the kingdom in someone elses' control. Just: No!
Print this status page to PDF so you've got it handy next time someone castigates you for not using Github Actions, folks.
This time today it was caused by friendly fire by the automatic suspension of the GitHub Actions bot which is now a "Ghost" user. Since there is no CEO of GitHub to contact it we are just going to see more [1] of this again.
You might need to push a critical change soon, but now you cannot. You won't get any of these issues if you self hosted as I said 6 years ago...[2]
[0] https://www.githubstatus.com/incidents/g6ffrm0rfvz9
I'm guessing related to this? The blog post is dated 11 days ago but I just noticed a blue banner on my actions page today.