Suppose you have a complete chain of N Christmas lights and they do not work when turned on. The temptation is to go through all the lights and to substitute in a single working light until you identify the non-working light.
But suppose there are multiple non-working lights? You'll never find the error with this approach. Instead, you need to start with the minimal working approach -- possibly just a single light (if your Christmas lights work that way), adding more lights until you hit an error. In fact, the best case is if you have a broken string of lights and a similar but working string of lights! Then you can easily swap a test bulb out of the broken string and into the working chain until you find all the bad bulbs in the broken string.
Starting with a minimal working example is the best way to fix a bug I have found. And you will find you resist this because you believe that you are close and it is too time-consuming to start from scratch. In practice, it tends to be a real time-saver, not the opposite.
So as a metaphor for software debugging, this is "throw away the code, buy a working solution from somewhere else." It may be a way to run a business, but it does not explain how to debug software.
Those cases are highly seductive because the code seems to work 98% and you'd think it is another 2% to get it working but you never get to the end of the 2%, it is like pushing a bubble around under a rug. (Many of today's LLM enthusiasts will, after they "pay their dues", will wind up sounding like me)
This is documented in https://www.amazon.com/Friends-High-Places-W-Livingston/dp/0... and takes its worst form when you've got an improperly designed database that has been in production for some time and cannot be 100% correctly migrated to a correct database structure.
Take the wikipedia article on debugging[1]. The first sentence identifies what debugging is: "In engineering, debugging is the process of finding the root cause, workarounds and possible fixes for bugs." I'd say this is implicitly about taking broken code and finding the error in it.
The second paragraph is clearer on this point:
> For software, debugging tactics can involve interactive debugging, control flow analysis, log file analysis, monitoring at the application or system level, memory dumps, and profiling. Many programming languages and software development tools also offer programs to aid in debugging, known as debuggers.
All of the debugging tactics listed are about working with an existing broken piece of code.
Further sections on debugging tools, debugging process ("debugging process normally begins with identifying the steps to reproduce the problem"), and techniques are also about working with broken code.
Completely rewriting the code is certainly a way to resume progress on a software project, but as the practice of debugging is imagined, rewriting is not debugging.
Really, that's important. You need to think clearly, deadlines and angry customers are a distraction. That's also when having a good manager who can trust you is important, his job is to shield you from all that so that you can devote all of your attention to solving the problem.
This also works well in conjunction with debug tooling -- the tooling gives you the raw information, writing down that information helps join the dots.
Another developer and I were tasked with fixing it. The Customer Service manager (although one of the most conniving political-destructive assholes I have ever not-quite worked with), actually carried a crap umbrella. Instead of constantly flaming us with how many millions of dollars our outage was costing every minute, he held up that umbrella and diverted the crap. His forbearance let us focus. He discretely approached every 20 minutes, toes not quite into entering office, calmly inquiring how it was going. In just over an hour (between his visits 3 and 4), Nate and I had the diagnosis, the fix, and had rolled it out to production, to the relief of pension funds worldwide.
As much as I dislike the memory of that manager to this day, I praise his wisdom every chance I get.
- "her job was to be a crap-umbrella": hyphenate into a compound noun, implies "an umbrella of/for crap" to clarify the intended meaning
- "her job was to be a crappy umbrella": make the adjective explicit if the intention was instead to describe an umbrella that doesn't work well
A better analogy is you’re there to turn down the noise. The team hears what they need to hear and no more.
Equally, the job of a good manager is to help escalate team concerns. But just as there’s a filter stopping the shit flowing down, you have to know what to flow up too.
Not all issues can be fixed with a rollback though.
Later I worked for a company with a much bigger and more critical website, and the difference in tone during urgent incidents was amazing. The management made itself available for escalations and took a role in externally communicating what was going on, but besides that they just trusted us to do our jobs. We could even go get a glass of water during the incident without a VP yelling at us. I hadn't realized until that point that being calm adults was an option.
“Slow is smooth, smooth is fast.”
I know this and still get caught out sometimes…
Edit: hahaha ‘Cerium saying ~same thing: https://news.ycombinator.com/item?id=42683671
Here's a walk through on using it: https://nickjanetakis.com/blog/using-git-bisect-to-help-find...
I jumped into a pretty big unknown code base in a live consulting call and we found the problem pretty quickly using this method. Without that, the scope of where things could be broken was too big given the context (unfamiliar code base, multiple people working on it, only able to chat with 1 developer on the project, etc.).
The thing is, I don't even bisect that often... the discipline necessary to maintain that in your source code heavily overlaps with the disciplines to prevent code regression and bugs in the first place, but when I do finally use it, it can pay for itself in literally one shot once a year, because we get bisect out for the biggest, most mysterious bugs, the ones that I know from experience can involve devs staring at code for potentially weeks, and while I'm yet to have a bisect that points at a one-line commit, I've definitely had it hand us multiple-day's-worth of clue in one shot.
If I was maintaining that discipline just for bisect we might quibble with the cost/benefits, but since there's a lot of other reasons to maintain that discipline anyhow, it's a big win for those sorts of disciplines.
Edit to add: I will study old diffs when there is a bug, particularly for bugs that seem correlated with a new release. Asking "what has changed since this used to work?" often leads to an obvious cause or at least helps narrow where to look. Also asking the person who made those changes for help looking at the bug can be useful, as the code may be more fresh in their mind than in yours.
You’re spot on.
However it’s clearly a missing feature that Git/Mercurial can’t tag diffs as “passes” or “bisectsble”.
This is especially annoying when you want to merge a stack of commits and the top passes all tests but the middle does not. It’s a monumental and valueless waste of time to fix the middle of the stack. But it’s required if you want to maintain bisectability.
It’s very annoying and wasteful. :(
I have only been in one place where "rebase" is used regularly, and now that I'm a little more familiar with it, I don't mind using it to bring in changes from a parent branch into a working branch, if the working branch hasn't been pushed to origin. It still weirds me out somewhat, and I don't see why a simple merge can't just be the preferred way.-
I have, however, seen "squashing" regularly (and my current position uses it as well as rebasing) -- and I don't particularly like it, because sometimes I put in notes and trials that get "lost" as the task progresses, but nonetheless might be helpful for future work. While it's often standard to delete "squashed" branches, I cannot help but think that, for history-minded folks like me, a good compromise would be to "squash and keep" -- so that the individual commits don't pollute the parent branch, while the details are kept around for anyone needing to review them.
Having said that, I've never been in a position where I felt like I need to "forcibly" push for my preferences. I just figure I might as well just "go with the flow", even if a tiny bit of me dies every time I squash or rebase something, or delete a branch upon merging!
But not linked together and those "closed" branches are mixed in with the current ones.
Instead, try out "git merge --no-ff" to merge back into master (forcing a merge commit to exist even if a fast-forward was possible) and "git log --first-parent" to only look at those merge commits. Kinda squash-like, but with all the commits still there.
I think every commit that gets merged to main should be an atomic believed-to-work thing. Not only does this make bisect way more effective, but it's a much better narrative for others to read. You should write code to be as readable by others as possible, and your git history likewise.
git bisect should operate on bisectable commits. Which may not be all commits. Git is missing information. This is, imho, a flaw in Git.
This is critical to apply in cases where the problem might not even be caused by a code commit in the repo you're bisecting!
Someone who thinks about a problem via "which tool do I want" (c.f. "git bisect helps a lot"[1]) is going to be at a huge disadvantage to someone else coming at the same decisions via "didn't this used to work?"[2]
The world is filled to the brim with tools. Trying to file away all the tools in your head just leads to madness. Embrace philosophy first.
[1] Also things like "use a time travel debugger", "enable logging", etc...
[2] e.g. "This state is illegal, where did it go wrong?", "What command are we trying to process here?"
This is one area where I've been disappointed by rust, they cleaned up testing, and getting dependencies, by getting them into core, but debugging is still a mess with several poorly supported cargo extensions, none of which seem to work consistently for me (no insult to their authors, who are providing something better than nothing!)
It's very rare for me to use debuggers.
Like, traffic going from A to B can turn ... complicated with VPNs and such. You kinda have source firewalls, source routing, connectivity of the source to a router, routing on the router, firewalls on the router, various VPN configs that can go wrong, and all of that on the destination side as well. There can easily be 15+ things that can cause the traffic to disappear.
That's why our runbook recommends to start troubleshooting by dumping traffic on the VPN nodes. That's a very low-effort, quick step to figure out on which of the six-ish legs of the journey drops traffic - to VPN, through VPN, to destination, back to VPN node, back through VPN, back to source. Then you realize traffic back to VPN node disappears and you can dig into that.
And this is a powerful concept to think through in system troubleshooting: Can I understand my system as a number of connected tubes, so that I have a simple, low-effort way to pinpoint one tube to look further into?
As another example, for many services, the answer here is to look at the requests on the loadbalancer. This quickly isolates which services are throwing errors blowing up requests, so you can start looking at those. Or, system metrics can help - which services / servers are burning CPU and thus do something, and which aren't? Does that pattern make sense? Sometimes this can tell you what step in a pipeline of steps on different systems fails.
When you don't know what is breaking that specific scroll or layout somewhere in the page, you can just remove half the DOM in the dev tools and check if the problem is still there.
Rinse and repeat, it's a basic binary search.
I am often surprised that leetcode black belts are absolutely unable to apply what they learn in the real world, neither in code nor debugging which always reminds me of what a useless metric to hire engineers it is.
I find myself doing this all the time now I will temporarily add a line to cause a fatal error, to check that it's the right file (and, depending on the situation, also the right line)
Once I'm satisfied with the test, I remove the line.
Usually, in the process of reducing my error-generating code down to a simpler case, I find the bug in my logic. I've been fortunate that heisenbugs have been rare.
Once or twice, I have ended up with something to report to the devs. Generally, those were libraries (probably from sourceforge/github) with only a few hundred or less users that did not get a lot of testing.
Writing tests isn't free, I agree, but in this case a good chunk of the cost of writing them will have already been paid in a way.
To the extent that other concerns get in the way of the concept, like the general difficulty of testing that GUIs do what they are supposed to do, I don't blame the concept of unit testing; I blame the techs that make the testing hard.
If anything I'd only keep those if it's hard to write them, if people push back against it (and I myself don't like them sometimes, e.g. when the goal is just to push up the coverage metric but without actually testing much, which only add test code to maintain but no real testing value...).
Like any topic there's no universal truth and lots of ways to waste time and effort, but this specifically is extremely practical and useful in a very explicit manner: just fix it once and catch it the next time before production. Massively reduce the chance one thing has to be fixed twice or more.
Some examples that come to mind are bugs to do with UI interactions, visuals/styling, external online APIs/services, gaming/simulation stuff, and asynchronous/thread code, where it can be a substantial effort to write tests for, vs fixing the bug that might just be a typo. This is really different compared to if you're testing some pure functions that only need a few inputs.
It depends on what domain you're working in, but I find people very rarely mention how much work certain kinds of test can be to write, especially if there aren't similar tests written already and you have to do a ton of setup like mocking, factories, and UI automation.
But I think all other things considered my impression still holds, and that I should maybe rather say they're easier to write in a way, though not necessarily easy.
Adding tests is not easy, and you can always find excuses to not do that and instead "promise" to do it later, which almost never happens. I have seen enough to know this. Which is why I myself have put more work in writing unit tests, refactoring code and creating test mocks than anyone else in my team.
And I can't tell you how much I appreciate it when I find that this has benefitted me personally -- when I need to write a new test, often I find that I can reuse 80% if not 80% of the test setup and focus on the test body.
After adding the first test, it becomes much easier to add the second and third test. If you don't add the first test or put the effort into making your code actually testable, it's never going to be easy to test anything.
It's not about being lazy or making excuses, it's not free to write exhaustive tests and the resources have to come from somewhere. For MVPs, getting the first release out is going to be much more important for example.
And I can tell you I have first hand experience of an "MVP" product getting delayed, multiple times, because management realizes that nobody wants to purchase our product when they discover how buggy and unusable it is.
It's very possible to write great software without any tests at all too. It's like people forget that coders used to write assembly code without source control, CI, mocking and all that other stuff.
I've lost count of how many things i've fixed only to to see;
1) It recurs because a deeper "cause" of the bug reactivated it.
2) Nobody knew I fixed something so everyone continued to operate workarounds as if the bug was still there.
I realise these are related and arguably already fall under "You didn't fix it". That said a bit of writing-up and root-cause analysis after getting to "It's fixed!" seems helpful to others.
Saying "oh its been good for awhile now" has nothing to do with breaking it in the future.
A lot of slow tests are slow because nobody has even tried to speed them up. They just wrote something that worked, probably literally years ago, that does something horrible like fully build a docker container and fully initialize a complicated database and fully do an install of the system and starts processes for everything and liberally uses "sleep"-based concurrency control and so on and so forth, which was fine when you were doing that 5 times but becomes a problem when you're trying to run it hundreds of times, and that's a problem, because we really ought to be running it hundreds of thousands or millions of times.
I would love to work on a project where we had so many well-optimized automated tests that despite their speed they were still a problem for building. I'm sure there's a few out there, but I doubt it's many.
For this to work all the regression tests must be fast, and 100% reliable. It's worth it though. If the mistake was made once, unless there's a regression test to catch it, it'll be made again at some point.
Doesn't matter how fast it is, if you're continually adding tests for every single line of code introduced eventually it will get so slow you will want to prune away old tests.
The Martian by Andy Weir https://en.wikipedia.org/wiki/The_Martian_(Weir_novel)
https://en.wikipedia.org/wiki/Zen_and_the_Art_of_Motorcycle_...
https://en.wikipedia.org/wiki/The_Three-Body_Problem_(novel)
To Engineer Is Human - The Role of Failure in Successful Design By Henry Petroski https://pressbooks.bccampus.ca/engineeringinsociety/front-ma...
https://en.wikipedia.org/wiki/Surely_You%27re_Joking,_Mr._Fe...!
Things just sort of happen with wild leaps of logic. The book is actually a fantasy book with thinnest layer of science babble on top.
"What you have to do, if you get caught in this gumption trap of value rigidity, is slow down...you're going to have to slow down anyway whether you want to or not...but slow down deliberately and go over ground that you've been over before to see if the things you thought were important were really important and to -- well -- just stare at the machine. There's nothing wrong with that. Just live with it for a while. Watch it the way you watch a line when fishing and before long, as sure as you live, you'll get a little nibble, a little fact asking in a timid, humble way if you're interested in it. That's the way the world keeps on happening. Be interested in it."
Words to live by
Maybe I'm mis-understand but "Read the manual, read everything in depth" sounds like. Oh, I have bug in my code, first read the entire manual of the library I'm using, all 700 pages, then read 7 books on the library details, now that a month or two has passed, go look at the bug.
I'd be curious if there's a single programmer that follows this advice.
0. https://stackoverflow.blog/2021/12/14/podcast-400-an-oral-hi...
1. https://www.perl.com/article/extracting-the-list-of-o-reilly...
> read everything in depth
Is not necessarily
> first read the entire manual of the library I'm using, all 700 pages
If I have problem with “git bisect”. I can go only to stackoverflow try several snippets and see what sticks, or I can also go to https://git-scm.com/docs/git-bisect to get a bit deeper knowledge on the topic.
The alternative is paradropping into an unknown system with a weird bug, messing randomly with things you don't understand until the tests turn green, and then submitting a PR and hoping you didn't just make everything even worse. It's never really knowing whether your system actually works or not. While I understand that is sometimes how it goes, doing that regularly is my nightmare.
P.S. if the manual of a library you're using is 700 pages, you're probably using the wrong library.
Statement bordering on papyrophobia. (Seems that is a real phobia)
- Scope of the library is massive
- very peculiar way of writing, with lot of impressively unnecessary description of minute detail, that the reader starts counting not only sheep but also breaths until the ultimate end before finishing a sentence. (I.e. Unnecessarily verbose)
- very extensive docs describing things from various angles including references, topic based how tos and such.
(I agree that the last one is the least likely, but there is always hope)
I would be much more interested in rules that don't start with that... Like "Rules for debugging when you don't have the capacity to fully understand every part of the system."
Bisecting is a great example here. If you are Bisecting, by definition you don't fully understand the system (or you would know which change caused the problem!)
Julia Evans also has a very nice zine on debugging: https://wizardzines.com/zines/debugging-guide/
> Assumption is the mother of all screwups.
1. Is this mistake somewhere else also?
2. What next bug is hidden behind this one?
3. What should I do to prevent bugs like this?
The list is fun for us to look at because it is so familiar. The enticement to read the book is the stories it contains. Plus the hope that it will make our juniors more capable of handling complex situations that require meticulous care...
The discussion on the article looks nice but the submitted title breaks the HN rule about numbering (IMO). It's a catchy take on the post anyway. I doubt I would have looked at a more mundane title.
A few times I looked for a bug like "something is not happening when it should" or "This is not the expected result", when the issue was with some config file, database records, or thing sent by a server.
For instance, particularly nasty are non-printable characters in text files that you don't see when you open the file.
"simulate the failure" is sometimes useful, actually. Ask yourself "how would I implement this behavior", maybe even do it.
Also: never reason on the absence of a specific log line. The logs can be wrong (bugged) too, sometimes. If you printf-debugging a problem around a conditional for instance, log both branches.
Rule 1 should be: Reproduce with most minimal setup.
99% you’ll already have found the bug.
1% for me was a font that couldn’t do a combination of letters in a row. life ft, just didn’t work and thats why it made mistakes in the PDF.
No way I could’ve ever known that if I wouldn’t have reproduced it down to the letter.
Just split code in half till you find what’s the exact part that goes wrong.
I'd argue that you can't effectively split something in half unless you first understand the system.
The book itself really is wonderful - the author is quite approachable and anything but dogmatic.
I just spent a whole day trying to figure out what was going on with a radio. Turns out I had tx/rx swapped. When I went to check tx/rx alignment I misread the documentation in the same way as the first. So, I would even add "try switching things anyways" to the list. If you have solid (but wrong) reasoning for why you did something then you won't see the error later even if it's right in front of you.
It's useful to get the poster and make sure everyone knows the rules.
10) Enable frame pointers [1].
[1] The return of the frame pointers:
1. set up the dev environment
2. fork/clone the code
3. create a new branch to make changes and tests
4. use the list to try to find the root cause
5. create a pull request if you think you have fixed the bug
And use Rule 0 from GuB-42: Don't panic
(edited for line breaks)
"Dave was asked as the author of Debugging to create a list of 5 books he would recommend to fans, and came up with this.
https://shepherd.com/best-books/to-give-engineers-new-perspe..."
Unfortunately, I found many times this is actually the most difficult step. I've lost count of how many times our QA reported an intermittent bug in their env, only to never be able to reproduce it again in the lab. Until it hits 1 or 2 customer in the field, but then when we try to take a look at customer's env, it's gone and we don't know when it could come back again.
I know it is the best route, I do know the system (maybe I wrote it) and yet time and again I don’t take the time to read what I should… and I make assumptions in hopes of speeding up the process/ fix, and I cost myself time…
I wish this were true, and maybe it was in 2004, but when you've got noise coming in from the cloud provider and noise coming in from all of your vendors I think it's actually quite likely that you'll see a failure once and never again.
I know I've fixed things for people without without asking if they ever noticed it was broken, and I'm sure people are doing that to me also.
From my experience, this is the single most important part of the process. Once you keep in mind that nothing paranormal ever happens in systems and everything has an explanation, it is your job to find the reason for things, not guess them.
I tell my team: just put your brain aside and start following the flow of events checking the data and eventually you will find where things mismatch.
I don't miss working there.
Don't be too embarassed to scatter debug logmessages in the code. It helps.
My second rule:
Don't forget to remove them when you're done.
The nicest log package I had would always count the number of times a log msg was hit even if the debug level meant nothing happened. The C preprocessor made this easy, haven't been able to get a short way to do this counting efficiently in other languages.
For intermediate representations this is better than printf to stdout
Thank you for introducing me to this book.
One of my favorite rules of debugging is to read the code in plain language. If the words don't make sense somewhere, you have found the problem or part of it.
I have found that 90% of network problems, are bad cables.
That's not an exaggeration. Most IT folks I know, throw out ethernet cables immediately. They don't bother testing them. They just toss 'em in the trash, and break a new one out of the package.
Each comment: "..and this is my 10th rule: <insert witty rule>"
Total number of rules when reaching the end of the post: 9 + n + n * m, with n being number of users commenting, m being the number of users not posting but still mentally commenting on the other users' comments.
Fortunately, my efforts with Dave weren't for naught: as part of testing our own ideas on the subject, I gave a series of presentations from ~2015 to ~2017 that described our thinking. A talk that pulls many of these together is my GOTO Chicago talk in 2017, on debugging production systems.[0] That talk doesn't incorporate all of our thinking, but I think it gets to a lot of it -- and I do think it stands at a bit of contrast to Wheeler's work.
On an unrelated note, one of the folks down there explained the DNS setup once and it was like something out of a Stephen King novel. They'd even been told by a recognized industry expert (whose name I sadly can't remember any more) that what they needed to do was impossible, but they still did it. Somehow.
They really were great folks, they just had that one quirk but after a while I could just chuckle about it.
Wheeler gets close to it by suggesting to locate which side of the bug you're on, but often I find myself doing this recursively until I locate it.
Bisecting is just as useful when searching for the layer of application which has the bug (including external libraries, OS, hardware, etc.) or data ranges that trigger the bug. There's just no handy tools like git bisect for that. So this amounts to writing down what you tested and removing the possibilities that you excluded with each test.
I've followed https://debugbetter.com/ for a few weeks and the content has been great!
You can’t trust a thing this person says if they’re not recommending a duck.
As someone who has been working on a debugging tool (https://undo.io) for close to two decades now, I totally agree that it's just weird how little attention debugging as a whole gets. I'm somewhat encouraged to see this topic staying near the top of hacker news for as long as it has.
AKA: “Problems that go away by themselves come back by themselves.”
Yeah, ain't nobody got time for that. If e.g. debugging a compile issue meant we read the compiler manual, we'd get nothing done...