We just use prometheus+grafana now. Yes, this technically also slows the app down, but OTel was unbearably slow.
I'm sure im doing a million and one things wrong, but i can't be arsed to set something up just to see some performance metrics. Deadlocks can be found using transaction metrics, that's all you need.
Edit: I now read in the comments the JS ver is a bad impl, I guess this might be part of the reason.
I haven't tried but it's probably possible to do the same with JS
I think Sentry was also expanding into tracing--might be worth a look to see if they're doing something that works better in their library
That said if your goal is basic performance metrics and nothing more, then tracing is overkill. Don’t even need an SDK, just monitor the compute nodes async with some basic system metrics. But if your goal is to narrow down behaviors within your app on a per-request basis there really is no way around tracing if you value your sanity.
Sadly, there was always an alternative that no one took: dtrace. Add USDTs to your code and then monitor progress by instrumenting it externally, sending the resulting traces to wherever you want. My sincere hope is that the renewed interest in ebpf makes this a reality soon: I never want to have to do another from opentelemetry.<whatever> import <whatever> ever again.
Honeycomb is decent at what it has, but very limited offerings for dashboard.
Coming from Datadog, Grafana is such a bad experience I want to cry every time I try to build out a service dashboard. So much more friction to get anything done like adding transform functions / operators, do smoothing or extrapolation, even time shifting is like pulling teeth. Plus they just totally broke all our graphs with formulas for like 2 days.
Grafana is to Datadog what Bugzilla is to Linear.
They have a SQL like query language that I think can do most of what you're describing
Heck I’m dying for “I can copy a graph between dashboards”. Grafana allows this but if any variable is in the graph but doesn’t exist in the destination, pasting just creates an empty graph.
Also I setup alerts - for >50% CPU or >90% disk full. I do get an alert but it doesn't say which volume or how full is it - what was the actual value that triggered sending the alert. WTF.
I enjoy having everything instrumented and in one spot, it's super powerful, but I am currently advocating for self hosting loki so that we can have debug+ level logs across all environments for a much much lower cost. Datadog is really good at identifying anomalies, but the cost for logs is so high there's a non-trivial amount of savings in sampling and minimizing logging. I HATE that we have told devs "don't log so much" -- that misses the entire point of building out a haystack. And sampling logs at 1%, and only logging warnings+ in prod makes it even harder to identify anomalies in lower environments before a prod release.
last hot take: The UX in kibana in 2016 was better than anything else we have now for rapidly searching through a big haystack, and identifying and correlating issues in logs.
[0] https://www.ibm.com/products/instana/opentelemetry
[1] https://github.com/instana/instana-otel-collector
[2] https://play-with.instana.io/#/home
[disclaimer: I'm an IBMer]
IIUC, Grafana connects directly to Honeycomb via its API to visualize data without storing it. Instana, on the other hand, is a bit different. It needs telemetry data to be ingested into backend before it can be visualized in UI. With Honeycomb, this could be possible if the data can be exported from Honeycomb to Instana.
i guess it depends on what you're used to
compared to dumping logs to a file (or a single instance Prometheus scraping /metrics) everything is frustrating, because there are so many moving parts anyway, you want to query stuff and correlate, but for that you need to propagate the trace id, and emit and store spans, and take care to properly handle async workers, and you want historical comparisons, and downsampling for retention, and of course auto-discovery/registration/labeling from k8s pods (or containers or whatever) and source code upload and release tagging from CI, and ...
half the things you list aren't even part of the sdks, they're part of the collector.
I can run a full tracing stack locally for dev use with minimal config.
The core issue is that, with otel, observability platforms become just a UI layer over a database. No one wants to invest in proper instrumentation, which is a difficult problem, so we end up with a tragedy of the commons where the instrumentation layer itself gets neglected as there is no money to be made there.
I don't think it's fair to say "no one wants to invest in proper instrumentation" - the OpenTelemetry community has built a massive amount of instrumentation in a relatively short period of time. Yes, OpenTelemetry is still young and unstable, but it's getting better every day.
As the article notes, the OpenTelemetry Collector has plugins can convert nearly any telemetry format to OTLP and back. Many of the plugins are "official" and maintained by employees of Splunk, Datadog, Snowflake, etc. Not only does it break the lock-in, but it allows you to reuse all the great instrumentation that's been built up over the years.
> The core issue is that, with otel, observability platforms become just a UI layer over a database.
I think this is a good thing - when everyone is on the same playing field (I can use Datadog instrumentation, convert it to OTel, then export it to Grafana Cloud/Prometheus), vendors will have to compete on performance and UX instead of their ability to lock us in with "golden handcuffs" instrumentation libraries.
These are issues you'd experience with anything that spans your stack as a custom telemetry library would.
Otel was made to basically track the request execution (and anything that request triggers) across multiple apps at once, not to instrument an app to find slow points
It's a great idea, in principle, but unless it gets strong backing from big tech, I think it'll fail. I'd love to be proven wrong.
But all major vendors _do_ contribute to OTEL.
The license is the key enabler for all of this. The vendors can't be all that sneaky in the code they contribute without much higher risk of being caught. Sure, they will focus on the funnel that brings more data to them, but that leaves others more time to work on the other parts.
Making sense out of so much data is why datadog and sentry make so much money.
Of course you could also roll your own telemtry, which is generally no that difficult in a lot of frameworks. You don't always need something like OTEL.
It needs to be treated as an integral part of whatever framework is being instrumented. And maintained by those same people.
Am I wrong?
EDIT: we actually have two. The one we use for Node, the author plans to open source it eventually. That one is drop in replacement for Span and Trace classes and Just Works with upstream Otel. Main blocker is that we have some patch-package to fix other performance issues with upstream, and need to make our stuff work with non-patched upstream.
The one we use for Workers is more janky and doesn’t make sense to open source. It’s like 100 total LoC but doesn’t have compatibility with existing Otel libraries.
It will always be overkill for just an app or two talking with eachother... till you grow, then it won't be overkill any more.
But still might be worth getting into it on smaller apps just thanks to wealth of tools available.
eBPF solves this by reversing the model: instrument the system, not the application. Turn it on / off dynamically, zero redeploys, minimal overhead.
The missing piece is accessibility. Kernel-level observability exists; "normal engineers can use it" and good DX does not.
I am running a few projects on a minimal Hetzner K3S cluster and just want some cheap easy observability to store logs, reduce log noise and instead rely on counters/metrics without paying an arm and a leg.
Languages used are Rust, Javascript and Python mostly.
The collector is a helm chart, someone on my project added it to our K8s clusters last week. It was like 30 lines of YAML/Terraform in total. Logs, trace forwarding, Prometheus scraping. That bit is easy.
Idk about deploying the ui/storage. I’ve used Grafana Loki stuff in Docker Compose locally without much head scratching for local development.
Native k8s logs + Prometheus is probably more on the lighter weight side but you don't get traces. You could find some middle ground using the otel collector to extract trace metrics so you get RED metrics but you wouldn't have full traces