Railway their hosting provider is entirely down as well
From https://status.railway.com/
>Identified
>Google Cloud has blocked our account, making some Railway services unavailable. We have escalated this directly with Google. The Railway Platform team has since confirmed access to Google Cloud and is working on restoring access to all workloads. We have access to some of our Google Cloud–hosted infrastructure and are working to restore the rest of the service. We apologize for the disruption.
The landing page is up again now but unfortunately will have to default to accepting demo requests for now :(
The tool I'd actually want isn't "tries harder to fix everything." It's one that credibly says "this touches an invariant I can't see — here's what I think might happen, you handle it." Calibrated humility beats confident patches.
Curious how your high-confidence threshold actually works. Self-reported model certainty (notoriously unreliable), test coverage in the affected area, blast-radius of the change, something else?
"Please check your network settings to confirm that your domain has provisioned.
If you are a visitor, please let the owner know you're stuck at the station."
Would love to learn more and consider being a customer!
This is interesting, and my prior belief here has been that this automates a one time set up, and perhaps a quarterly clean-up or reactive monitoring changes that people do today. Curious what your experience has been - do teams accept these ongoing maintenance PRs at a good rate?
For full disclosure / context: we work in a related space - investigation agents for production issues.
Telemetry goes to some provider or local hosted solution? And then to your upstream ai provider for analysis?
When you're installing Superlog, you can use any coding agent you'd like, including a local model.
Your telemetry then goes into our data stores, and right now we have one DC on the US west coast.
Whenever there's an error log or trace, Superlog can analyze it and prepare a resolution PR (or a note if something needs to be done manually).
This can be turned off and then the incident can be sent to your own models via a webhook.
We use one of the frontier models for that (it's an upstream AI provider). We're working on our own fine-tuned version of a SoTA model to minimize dependency on other AI providers.
To investigate an incident, we clone the repo in our worker, and pass the repository files to a coding agent in a sandbox. The agent has an MCP that gives it access to the telemetry (logs/metrics/traces) of the project.
The coding agent will then investigate the incident and prepare a patch. It hands over the patch via a tool. The worker then deterministically pushes the patch to a branch and opens the PR.
This way the agent doesn't have full Git access and can't do anything it's not supposed to do in the repository.
There are good ways to link operations between different services with OpenTelemetry (for example, passing the parent trace id in an inter-service HTTP/gRPC request). It's a bit tedious to do by hand, that's why we're publishing the skill that does that for you.
And totally agreed on config changes and deploy info. We've seen that having good environment and version control (commit hash, file name, line number) tagging is extremely important for root cause analysis, so we go hard on this in the skills.
We also have many infra integrations in our roadmap to make sure that we can deeply analyze the infra/config side of things.
I kid, nice work. As others have said, investigation, and understanding "the why it was originally done that way", not the patch, is usually the lion share of the work.
That been said for more complex setups like on kubernetes where you need a collector and an operator I found OTEL to be super painful to setup a couple of years ago. Has it gotten any easier now?
I'm afraid a collector and the operator are still the recommended way to go by OpenTelemetry (https://opentelemetry.io/docs/platforms/kubernetes/getting-s...). We're still working on a custom skill for Kubernetes, but the general skill should give you a sane default already.
A good way to start can be to start sending traces/logs directly by instrumenting the service and putting our backend as the collector.
I also help out personally whenever our clients have any questions on setting up the telemetry :)
It's something we've thought a lot about at Amplitude. We'd love to talk.
Right now, the prompt will enumerate all the services and install the OpenTelemetry SDK (https://opentelemetry.io/) in each service.
Then for every service, the skill will make sure that:
- Every time something breaks and an operator needs to take a look, there's an error log - All important steps in a process emit info/debug logs (so that an issue can be investigated) - Operations are covered with spans with relevant attributes. - Cost (LLM tokens), API performance (latency/RED), tenant activity (cost/usage per tenant) are covered by metrics so that you can use Superlog MCP to build cool dashboards.
For most common stacks like NextJS, FastAPI, React Native/Expo etc. we have a custom skill that explains the best practices for this specific technology. For all the other stacks we ask the agent to use general best practices.
We have evals for all custom skills where we start from a starter project, run the agent with the skill and use LLM-as-a-judge to compare it to a human-written 'golden patch'.
In general, we try to:
- minimize diff, so that the instrumentation is easy to review - make small chunks of additive diffs vs huge indents / moving logic around - minimize new dependencies - use well-supported and audited OTel SDKs vs custom libs
You can read the skills here: https://github.com/superloglabs/skills.
I'll make sure to add this to our landing and print this out as the agent writes the code!
Thank you for the feedback!
I made the Slack onboarding step mandatory for now since we thought that a lot of our value was in sending investigations and PRs, and Slack was what we used ourselves.
What tool do you use for communication around your project? If you don't want to share publicly, could you please shoot a line to:
ash [at] superlog.sh?
Would love to learn about your usecase in more detail too!
For my current project I would use webhooks/email just like I do currently for my monitoring and alerting.
The platform itself doesn't need Slack to function, we just observed that users got more value if they could get notifications somehow, so I'm more than happy to add more comms platforms :)
> Start with one repo. Price the rest when the signal is real.
which makes it sound like possibly the $150/mo price is per-repo?
I think that could use some clarification - if I have 10 services in a monorepo vs 10 individual service repos, does that 10x my cost?
The pricing is only by usage (traces/logs/metrics) and investigation credits. We don't charge extra for repos :)
Could you please send me an email at ash [at] superlog.sh? I'd love to hear more about your use case - we might have something for you very soon!
1. Counter-positioning. Most existing tools have invested heavily in their web platforms and compete on their UI/UX. But actually, what matters to our clients is that bugs are fixed. Our top clients would rather never open our tool at all. If our competitors want to beat us, they essentially have to fight against their established business models that hinge on users looking at their browsers.
2. Evals. In order to have the most accurate RCA analysis you need a very good suite of evals: what was the right root cause in this bug? what is the right fix?. We're investing into this heavily, and as one of the early movers we have a big advantage here.
At the same time, I tend to approach strategy with a lot of caution. A lot of the canonical reasoning behind 'startup positioning' is based on extrapolation from trends, but surprisingly few analogies work in economics.
Our focus right now is: - talking to our users - making sure they have the best experience
We have an eval suite with code+telemetry fixtures and a golden RCA+patches and an LLM-as-a-Judge. So whenever we get feedback from our users and they're OK with it, we use their feedback to create an eval case (it's still quite manual since you have to calibrate the case).
We use Superlog to observe Superlog, so I often extract cases from our own errors. The PRs get better and better, but, of course, it's sort of a continuous improvement process.
The moment something changes the system, it no longer observes it, in fact observing something might cause it to change ( https://en.wikipedia.org/wiki/Observer_effect_(physics) )
Either it's a tool for observing or it's a tool for fixing issues, it cannot be both, by physical principle.
Best case scenario here is that the product succeeds, and then you need to instrument the product itself in order to observe it, like debugging the debugger. But it wouldn't be an observability tool, it would shift the product that needs to be observed from the previous source code that is now a target language into the new source code that is now your product.
I agree with the philosophical principle! If you give a rigid observer an incentive to 'remove bugs', it will happily silence all alerts and report success.
Our goal is to make sure that doesn't happen. The investigation agent is actually a separate agent with a separate goal.
In practice, we rarely see the agent just silencing stuff. When this happens, I get on it and make it an eval case :)
I guess the change in voltages, arrangement of registers, filling of buffers in the network stack are changing but... what?
Anecdotally, our top clients accept 80-90% of PRs, with several clients accepting all of them and requesting an auto-merge feature. I myself accept most of Superlog PRs to Superlog. Most PRs that stay unmerged are usually due to a client losing interest in our product / abandoning the instrumented project.
Another interesting point is that not every defect is a PR. Often it's misconfiguration in an external service, so there's a special incident state for that. For example, yesterday I forgot to verify our domain on Resend so some verification emails didn't go through. Superlog pinged me on Slack and explained where to go to fix.
Super glad you like the npx onboarding and the MCP tool :) Please keep the feedback coming!