Debugging: Indispensable rules for finding even the most elusive problems (2004)(dwheeler.com)

532 pointsby omkar-foss6 months ago62 comments

hughdbrown6 months ago
In my experience, the most pernicious temptation is to take the buggy, non-working code you have now and to try to modify it with "fixes" until the code works. In my experience, you often cannot get broken code to become working code because there are too many possible changes to make. In my view, it is much easier to break working code than it is to fix broken code.
Suppose you have a complete chain of N Christmas lights and they do not work when turned on. The temptation is to go through all the lights and to substitute in a single working light until you identify the non-working light.
But suppose there are multiple non-working lights? You'll never find the error with this approach. Instead, you need to start with the minimal working approach -- possibly just a single light (if your Christmas lights work that way), adding more lights until you hit an error. In fact, the best case is if you have a broken string of lights and a similar but working string of lights! Then you can easily swap a test bulb out of the broken string and into the working chain until you find all the bad bulbs in the broken string.
Starting with a minimal working example is the best way to fix a bug I have found. And you will find you resist this because you believe that you are close and it is too time-consuming to start from scratch. In practice, it tends to be a real time-saver, not the opposite.
- randerson6 months ago
  The quickest solution, assuming learning from the problem isn't the priority, might be to replace the entire chain of lights without testing any of them. I've been part of some elusive production issues where eventually 1-2 team members attempted a rewrite of the offending routine while everyone else debugged it, and the rewrite "won" and shipped to production before we found the bug. Heresy I know. In at least one case we never found the bug, because we could only dedicate a finite amount of time to a "fixed" issue.
  - hughdbrown6 months ago
    > The quickest solution, assuming learning from the problem isn't the priority, might be to replace the entire chain of lights without testing any of them.
    So as a metaphor for software debugging, this is "throw away the code, buy a working solution from somewhere else." It may be a way to run a business, but it does not explain how to debug software.
    PaulHoule6 months ago
    The worst cases in debugging are the ones where the mental model behind the code is wrong and, in those cases, "throw it away" is the way out.
    Those cases are highly seductive because the code seems to work 98% and you'd think it is another 2% to get it working but you never get to the end of the 2%, it is like pushing a bubble around under a rug. (Many of today's LLM enthusiasts will, after they "pay their dues", will wind up sounding like me)
    This is documented in https://www.amazon.com/Friends-High-Places-W-Livingston/dp/0... and takes its worst form when you've got an improperly designed database that has been in production for some time and cannot be 100% correctly migrated to a correct database structure.
    yakshaving_jgt6 months ago
    I’m not sure it implies buying from elsewhere. I understood it to mean make a new one, rather than try to repair the broken one.
    hughdbrown6 months ago
    Let me reformulate then.
    Take the wikipedia article on debugging[1]. The first sentence identifies what debugging is: "In engineering, debugging is the process of finding the root cause, workarounds and possible fixes for bugs." I'd say this is implicitly about taking broken code and finding the error in it.
    The second paragraph is clearer on this point:
    > For software, debugging tactics can involve interactive debugging, control flow analysis, log file analysis, monitoring at the application or system level, memory dumps, and profiling. Many programming languages and software development tools also offer programs to aid in debugging, known as debuggers.
    All of the debugging tactics listed are about working with an existing broken piece of code.
    Further sections on debugging tools, debugging process ("debugging process normally begins with identifying the steps to reproduce the problem"), and techniques are also about working with broken code.
    Completely rewriting the code is certainly a way to resume progress on a software project, but as the practice of debugging is imagined, rewriting is not debugging.
    [1] https://en.wikipedia.org/wiki/Debugging
  - eru6 months ago
    Depending on how your routine looks like, you could run both on a given input, and see when / where they differ.
GuB-426 months ago
Rule 0: Don't panic
Really, that's important. You need to think clearly, deadlines and angry customers are a distraction. That's also when having a good manager who can trust you is important, his job is to shield you from all that so that you can devote all of your attention to solving the problem.
- augbog6 months ago
  100% agree. I remember I had an on-call and our pagerduty started going off for a SEV-2 and naturally a lot of managers from teams that are affected are in there sweating bullets because their products/features/metrics are impacted. It can get pretty frustrating having so many people try to be cooks in the kitchen. We had a great manager who literally just moved us to a different call/meeting and he told us "ignore everything those people are saying; just stay focused and I'll handle them." Everyone's respect for our manager really went up from there.
- ianmcgowan6 months ago
  There's a story in the book - on nuclear submarines there's a brass bar in front of all the dials and knobs, and the engineers are trained to "grab the bar" when something goes wrong rather than jumping right to twiddling knobs to see what happens.
  - throwawayfks6 months ago
    I read this book and took this advice to heart. I don't have a brass bar in the office, but when I'm about to push a button that could cause destructive changes, especially in prod, my hands reflexively fly up into the air while I double-check everything.
    tetha6 months ago
    A weird, yet effective recommendation from someone at my last job: If it's a destructive or dangerous action in prod, touch both your elbows first. This forces ou to take the hands away from the keyboard, stop any possible auto-pilot and look what you're doing.
    gregthelaw6 months ago
    Related: write down what you're seeing (or rather, what you _think_ you're seeing), and so with pen and paper, not the keyboard. You can type way faster than you can write, and the slowness of writing makes you think harder about what you think you know. Often you do the know the answer, you just have to tease it out. Or there are gaps in your knowledge that you hadn't clocked. After all, an assumption is something you don't realise you've made.
    This also works well in conjunction with debug tooling -- the tooling gives you the raw information, writing down that information helps join the dots.
    PaulHoule6 months ago
    Reminds me of https://en.wikipedia.org/wiki/Pointing_and_calling
    ianmcgowan6 months ago
    Yes! I do this when doing anything destructive ("rm -rf dir", "drop db", etc.) - i just stop and say out loud what environment I'm in, what the command I'm running is going to do and why I am doing it. This is in a remote-work situation, in an office I would have someone come double check the really destructive things before hitting enter..
  - toolslive6 months ago
    "a good chess player sits on his hands" (NN). It's good advice as it prevents you from playing an impulsive move.
    stronglikedan6 months ago
    I have to sit on my hands at the dentist to prevent impulse moves.
  - scubbo6 months ago
    Thank you for explaining that phrase! I couldn't find it with a quick Google.
  - Shorn6 months ago
    I always wanted that to be a true story, but I don't think it is.
- Cerium6 months ago
  Slow is smooth and smooth is fast. If you don't have time to do it right, what makes you think there is time to do it twice?
  - sitkack6 months ago
    "We have to do something!"
    Bootvis6 months ago
    And this is something, so I’m doing it.
- adamc6 months ago
  I had a boss who used to say that her job was to be a crap umbrella, so that the engineers under her could focus on their actual jobs.
  - r-bryan6 months ago
    I once worked with a company that provided IM services to hyper competitive, testosterone poisoned options traders. On the first fine trading day of a January new year, our IM provider rolled out an incompatible "upgrade" to some DLL that we (our software, hence our customers) relied on, that broke our service. Our customers, ahem, let their displeasure be known.
    Another developer and I were tasked with fixing it. The Customer Service manager (although one of the most conniving political-destructive assholes I have ever not-quite worked with), actually carried a crap umbrella. Instead of constantly flaming us with how many millions of dollars our outage was costing every minute, he held up that umbrella and diverted the crap. His forbearance let us focus. He discretely approached every 20 minutes, toes not quite into entering office, calmly inquiring how it was going. In just over an hour (between his visits 3 and 4), Nate and I had the diagnosis, the fix, and had rolled it out to production, to the relief of pension funds worldwide.
    As much as I dislike the memory of that manager to this day, I praise his wisdom every chance I get.
  - airblade6 months ago
    At first I thought you meant an umbrella that doesn't work very well.
    FearNotDaniel6 months ago
    Ah, the unintentional ambiguity of language, the reason there are so many lawyers in the world and why they are so expensive. The GP's phrasing is not incorrect but your comment made me realize: I only parsed it correctly the first time because I've heard managers use similar phrases so I recognized the metaphor immediately. But for the sake of reducing miscommunication, which sadly tends to trigger so many conflicts, I could offer a couple of disambiguatory alternatives:
    - "her job was to be a crap-umbrella": hyphenate into a compound noun, implies "an umbrella of/for crap" to clarify the intended meaning
    - "her job was to be a crappy umbrella": make the adjective explicit if the intention was instead to describe an umbrella that doesn't work well
  - louthy6 months ago
    I always say this too. But the real trick is knowing what to let thru. You can’t just shield your team from everything going on in the organisation. You’re all a part of the organisation and should know enough to have an opinion.
    A better analogy is you’re there to turn down the noise. The team hears what they need to hear and no more.
    Equally, the job of a good manager is to help escalate team concerns. But just as there’s a filter stopping the shit flowing down, you have to know what to flow up too.
  - dazzawazza6 months ago
    Ideally it's crap umbrellas all the way down. Everyone should be shielding everyone below them from the crap slithering its way down.
    saghm6 months ago
    Agreed. Even as a relatively junior engineer at my first job, I realized that a certain amount of the job was not worth exposing interns to (like excessive amounts of Jira ticket wrangling) because it would take parts of their already limited time away from doing things that would benefit them far more. Unless there's quite literally no one "below" you, there's probably _something_ you can do to stop shit from flowing down past you.
    generic920346 months ago
    That must be the trickle-down effect everyone talking about in the 80ies. ;)
- o_nate6 months ago
  A corollary to this is always have a good roll-back plan. It's much nicer to be able to roll-back to a working version and then be able to debug without the crisis-level pressure.
  - sroussey6 months ago
    Rollback ability is a must—it can be the most used mitigation if done right.
    Not all issues can be fixed with a rollback though.
- CobrastanJorji6 months ago
  I once worked for a team that, when a serious visible incident occurred, a company VP would pace the floor, occasionally yelling, describing how much money we were losing per second (or how much customer trust if that number was too low) or otherwise communicating that we were in a battlefield situation and things were Very Critical.
  Later I worked for a company with a much bigger and more critical website, and the difference in tone during urgent incidents was amazing. The management made itself available for escalations and took a role in externally communicating what was going on, but besides that they just trusted us to do our jobs. We could even go get a glass of water during the incident without a VP yelling at us. I hadn't realized until that point that being calm adults was an option.
- chrsig6 months ago
  Also a pager/phone going off incessantly isn't useful either. manage your alarms or you'll be throwing your phone at a wall.
- bilekas6 months ago
  This is very underrated. Also an extension to this is don’t be afraid to break things further to probe. I often see a lot of devs mid level included panicking and thus preventing them to even know where to start. I’ve come to believe that some people just have an inherent intuition and some just need to learn it.
  - jimmySixDOF6 months ago
    Yes its sometimes instinct takes over when your on the spot in a pinch but there are institutional things you can do to be prepared in advance that expand your set of options in the moment much like a pre-prepared firedrill playbook you can pull from also there are training courses like Kepner-Tregoe but you are right there are just some people who do better than others when _it's hitting the fan.
- bch6 months ago
  > Rule 0: Don’t panic
  “Slow is smooth, smooth is fast.”
  I know this and still get caught out sometimes…
  Edit: hahaha ‘Cerium saying ~same thing: https://news.ycombinator.com/item?id=42683671
- ahci8e6 months ago
  Uff, yeah. I used to work with a guy who would immediately turn the panic up to 11 at the first thought of a bug in prod. We would end up with worse architecture after his "fix" or he would end up breaking something else.
- roxyrox6 months ago
  agreed. it’s practically a prerequisite for everything else in the book. Staying calm and thinking clearly is foundational
nickjj6 months ago
For #4 (divide and conquer), I've found `git bisect` helps a lot. If you have a known good commit and one of dozens or hundreds of commits after that is bad, this can help you identify the bad commit / code in a few steps.
Here's a walk through on using it: https://nickjanetakis.com/blog/using-git-bisect-to-help-find...
I jumped into a pretty big unknown code base in a live consulting call and we found the problem pretty quickly using this method. Without that, the scope of where things could be broken was too big given the context (unfamiliar code base, multiple people working on it, only able to chat with 1 developer on the project, etc.).
- jerf6 months ago
  "git bisect" is why I maintain the discipline that all commits to the "real" branch, however you define that term, should all individually build and pass all (known-at-the-time) tests and generally be deployable in the sense that they would "work" to the best of your knowledge, even if you do not actually want to deploy that literal release. I use this as my #1 principle, above "I should be able to see every keystroke ever written" or "I want every last 'Fixes.' commit" that is sometimes advocated for here, because those principles make bisect useless.
  The thing is, I don't even bisect that often... the discipline necessary to maintain that in your source code heavily overlaps with the disciplines to prevent code regression and bugs in the first place, but when I do finally use it, it can pay for itself in literally one shot once a year, because we get bisect out for the biggest, most mysterious bugs, the ones that I know from experience can involve devs staring at code for potentially weeks, and while I'm yet to have a bisect that points at a one-line commit, I've definitely had it hand us multiple-day's-worth of clue in one shot.
  If I was maintaining that discipline just for bisect we might quibble with the cost/benefits, but since there's a lot of other reasons to maintain that discipline anyhow, it's a big win for those sorts of disciplines.
  - aag6 months ago
    Sometimes you'll find a repo where that isn't true. Fortunately, git bisect has a way to deal with failed builds, etc: three-value logic. The test program that git bisect runs can return an exit value that means that the failure didn't happen, a different value that means that it did, or a third that means that it neither failed nor succeeded. I wrote up an example here:
    https://speechcode.com/blog/git-bisect
    Noumenon726 months ago
    Very handy! I forgot about `git bisect skip`.
  - SoftTalker6 months ago
    I do bisecting almost as a last resort. I've used it when all else fails only a few times. Especially as I've never worked on code where it was very easy to just build and deploy a working debug system from a random past commit.
    Edit to add: I will study old diffs when there is a bug, particularly for bugs that seem correlated with a new release. Asking "what has changed since this used to work?" often leads to an obvious cause or at least helps narrow where to look. Also asking the person who made those changes for help looking at the bug can be useful, as the code may be more fresh in their mind than in yours.
  - forrestthewoods6 months ago
    > why I maintain the discipline that all commits to the "real" branch, however you define that term, should all individually build and pass all (known-at-the-time) tests and generally be deployable in the sense that they would "work" to the best of your knowledge, even if you do not actually want to deploy that literal release
    You’re spot on.
    However it’s clearly a missing feature that Git/Mercurial can’t tag diffs as “passes” or “bisectsble”.
    This is especially annoying when you want to merge a stack of commits and the top passes all tests but the middle does not. It’s a monumental and valueless waste of time to fix the middle of the stack. But it’s required if you want to maintain bisectability.
    It’s very annoying and wasteful. :(
    michalsustr6 months ago
    This is why we use squash like here https://docs.gitlab.com/ee/user/project/merge_requests/squas...
    snowfarthing6 months ago
    As someone who doesn't like to see history lost via "rebase" and "squashing" branches, I have had to think through some of these things, since my personal preferences are often trampled on by company policy.
    I have only been in one place where "rebase" is used regularly, and now that I'm a little more familiar with it, I don't mind using it to bring in changes from a parent branch into a working branch, if the working branch hasn't been pushed to origin. It still weirds me out somewhat, and I don't see why a simple merge can't just be the preferred way.-
    I have, however, seen "squashing" regularly (and my current position uses it as well as rebasing) -- and I don't particularly like it, because sometimes I put in notes and trials that get "lost" as the task progresses, but nonetheless might be helpful for future work. While it's often standard to delete "squashed" branches, I cannot help but think that, for history-minded folks like me, a good compromise would be to "squash and keep" -- so that the individual commits don't pollute the parent branch, while the details are kept around for anyone needing to review them.
    Having said that, I've never been in a position where I felt like I need to "forcibly" push for my preferences. I just figure I might as well just "go with the flow", even if a tiny bit of me dies every time I squash or rebase something, or delete a branch upon merging!
    Izkata6 months ago
    > I cannot help but think that, for history-minded folks like me, a good compromise would be to "squash and keep" -- so that the individual commits don't pollute the parent branch, while the details are kept around for anyone needing to review them.
    But not linked together and those "closed" branches are mixed in with the current ones.
    Instead, try out "git merge --no-ff" to merge back into master (forcing a merge commit to exist even if a fast-forward was possible) and "git log --first-parent" to only look at those merge commits. Kinda squash-like, but with all the commits still there.
    plagiarist6 months ago
    I think if you find the notes and trials as helpful for reasoning, likely other people might enjoy them as documentation in the final version?
    rlkf6 months ago
    I use git-format-patch to create a list of diffs for the individual commits before the branch gets squashed, and tuck them away in a private directory. Several times have I gone back to peek at those lists to understand my own thoughts later.
    r-bryan6 months ago
    Well dang, we are not restricted to git as the only place we can put historical metadata. You know, discursive comments, for starters?
    forrestthewoods6 months ago
    I explicitly don’t want squash. The commits are still worth keeping separate. There’s lots of distinct pieces of work. But sometimes you break something and fix it later. Or you add something new but support different environments/platforms later.
    gregthelaw6 months ago
    But if you don't squash, doesn't this render git bisect almost useless?
    I think every commit that gets merged to main should be an atomic believed-to-work thing. Not only does this make bisect way more effective, but it's a much better narrative for others to read. You should write code to be as readable by others as possible, and your git history likewise.
    Izkata6 months ago
    Individual atomic working commits don't necessarily make a full feature. Most of the time I build features up in stages and each commit works on its own, even without completing the feature in question.
    forrestthewoods6 months ago
    You should read my original comment.
    git bisect should operate on bisectable commits. Which may not be all commits. Git is missing information. This is, imho, a flaw in Git.
    badmintonbaseba6 months ago
    git bisect --first-parent
    badmintonbaseba6 months ago
    git bisect supports --first-parent, no need to squash if you just merge.
    Izkata6 months ago
    If there's a way to identify those incomplete commits, git bisect does support "skip" - a commit that's neither good nor bad, just ignored.
    JetSetIlly6 months ago
    Can we use git trailers for this? Something like "commit: incomplete" in the commit message.
    rlkf6 months ago
    Could you not use --first-parent option to test only at the merge-points?
  - skydhash6 months ago
    Same. Every branch apart from the “real” one and release snapshots is transient and WIP. They don’t get merged back unless tests pass.
  - 6 months ago
    undefined
- smcameron6 months ago
  Back in the 1990s, while debugging some network configuration issue a wiser older colleague taught me the more general concept that lies behind git bisect, which is "compare the broken system to a working system and systematically eliminate differences to find the fault." This can apply to things other than software or computer hardware. Back in the 90s my friend and I had identical jet-skis on a trailer we shared. When working on one of them, it was nice to have its twin right there to compare it to.
- yuliyp6 months ago
  The principle here "bisection" is a lot more general than just "git bisect" for identifying ranges of commits. It can also be used for partitioning the space of systems. For instance, if a workflow with 10 steps is broken, can you perform some tests to confirm that 5 of the steps functioned correctly? Can you figure out that it's definitely not a hardware issue (or definitely a hardware issue) somewhere?
  This is critical to apply in cases where the problem might not even be caused by a code commit in the repo you're bisecting!
- ajross6 months ago
  Not to complain about bisect, which is great. But IMHO it's really important to distinguish the philosophy and mindspace aspect to this book (the "rules") from the practical advice ("tools").
  Someone who thinks about a problem via "which tool do I want" (c.f. "git bisect helps a lot"[1]) is going to be at a huge disadvantage to someone else coming at the same decisions via "didn't this used to work?"[2]
  The world is filled to the brim with tools. Trying to file away all the tools in your head just leads to madness. Embrace philosophy first.
  [1] Also things like "use a time travel debugger", "enable logging", etc...
  [2] e.g. "This state is illegal, where did it go wrong?", "What command are we trying to process here?"
  - gregthelaw6 months ago
    I've spent the past two decades working on a time travel debugger so obviously I'm massively biassed, but IMO most programmers are not nearly as proficient in the available debug tooling as they should be. Consider how long it takes to pick up a tool so that you at least have a vague understanding of what it can do, and compare to how much time a programmer spends debugging. Too many just spend hour after hour hammering out printf's.
    CJefferson6 months ago
    I find the tools are annoyingly hard to use, particularly when a program is using a build system you aren't familiar with. I love time travelling debuggers, but I've also lost hours to getting large java, or C++ programs, into any working debuggers along with their debugging symbols (for C++).
    This is one area where I've been disappointed by rust, they cleaned up testing, and getting dependencies, by getting them into core, but debugging is still a mess with several poorly supported cargo extensions, none of which seem to work consistently for me (no insult to their authors, who are providing something better than nothing!)
    codr76 months ago
    Different mindsets, I find many problems easier to reason about from traces.
    It's very rare for me to use debuggers.
  - nottorp6 months ago
    Just be careful to not contradict #3 “Quit thinking and look”.
    ajross6 months ago
    Touché
- Icathian6 months ago
  Tacking on my own article about git bisect run. It really is an amazing little tool.
  https://andrewrepp.com/git_bisect_run
- tetha6 months ago
  You can also use divide and conquer when dealing with a complex system.
  Like, traffic going from A to B can turn ... complicated with VPNs and such. You kinda have source firewalls, source routing, connectivity of the source to a router, routing on the router, firewalls on the router, various VPN configs that can go wrong, and all of that on the destination side as well. There can easily be 15+ things that can cause the traffic to disappear.
  That's why our runbook recommends to start troubleshooting by dumping traffic on the VPN nodes. That's a very low-effort, quick step to figure out on which of the six-ish legs of the journey drops traffic - to VPN, through VPN, to destination, back to VPN node, back through VPN, back to source. Then you realize traffic back to VPN node disappears and you can dig into that.
  And this is a powerful concept to think through in system troubleshooting: Can I understand my system as a number of connected tubes, so that I have a simple, low-effort way to pinpoint one tube to look further into?
  As another example, for many services, the answer here is to look at the requests on the loadbalancer. This quickly isolates which services are throwing errors blowing up requests, so you can start looking at those. Or, system metrics can help - which services / servers are burning CPU and thus do something, and which aren't? Does that pattern make sense? Sometimes this can tell you what step in a pipeline of steps on different systems fails.
- jvans6 months ago
  git bisect is an absolute power feature everybody should be aware of. I use it maybe once or twice a year at most but it's the difference between fixing a bug in an hour vs spending days or weeks spinning your wheels
- epolanski6 months ago
  Bisection is also useful when debugging css.
  When you don't know what is breaking that specific scroll or layout somewhere in the page, you can just remove half the DOM in the dev tools and check if the problem is still there.
  Rinse and repeat, it's a basic binary search.
  I am often surprised that leetcode black belts are absolutely unable to apply what they learn in the real world, neither in code nor debugging which always reminds me of what a useless metric to hire engineers it is.
- rozap6 months ago
  Binary search rules. Being systematic about dividing the problem in half, determining which half the issue is in, and then repeating applies to non software problems quite well. I use the strategy all the time while troubleshooting issue with cars, etc.
qwertox6 months ago
Make sure you're editing the correct file on the correct machine.
- ZedZark6 months ago
  Yep, this is a variation of "check the plug"
  I find myself doing this all the time now I will temporarily add a line to cause a fatal error, to check that it's the right file (and, depending on the situation, also the right line)
  - overhead40756 months ago
    This is also covered by "make it fail"
  - shmoogy6 months ago
    I'm glad I'm not the only one doing this after I wasted too much time trying to figure out why my docker build was not reflecting the changes ... never again..
- eddd-ddde6 months ago
  How much time I've wasted unknowingly editing generated files, out of version files, forgetting to save, ... only god knows.
- netcraft6 months ago
  the biggest thing I've always told myself and anyone ive taught: make sure youre running the code you think youre running.
  - codr76 months ago
    Baby steps, if the foundation is shaky no amount of reasoning on top is going to help.
- ajuc6 months ago
  That's why you make it break differently first. To see your changes have any effect.
  - snowfarthing6 months ago
    When working on a test that has several asserts, I have adopted the process of adding one final assert, "assert 'TEST DEBUGGED' is False", so that even when I succeed, the test fails -- and I could review to consider if any other tests should be added or adjusted.
    Once I'm satisfied with the test, I remove the line.
- chupasaurus6 months ago
  Poor Yorick!
- spacebanana76 months ago
  Also that it's in the correct folder
- reverendsteveii6 months ago
  very first order of business: git stash && git checkout main && git pull
- n144q6 months ago
  ...and you are building and running the correct clone of a repository
heikkilevanto6 months ago
Some additional rules: - "It is your own fault". Always suspect your code changes before anything else. It can be a compiler bug or even a hardware error, but those are very rare. - "When you find a bug, go back hunt down its family and friends". Think where else the same kind of thing could have happened, and check those. - "Optimize for the user first, the maintenance programmer second, and last if at all for the computer".
- physicles6 months ago
  The first one is known in the Pragmatic Programmer as “select isn’t broken.” Summarized at https://blog.codinghorror.com/the-first-rule-of-programming-...
- bsammon6 months ago
  Alternatively, I've found the "Maybe it's a bug. I'll try an make a test case I can report on the mailing list" approach useful at times.
  Usually, in the process of reducing my error-generating code down to a simpler case, I find the bug in my logic. I've been fortunate that heisenbugs have been rare.
  Once or twice, I have ended up with something to report to the devs. Generally, those were libraries (probably from sourceforge/github) with only a few hundred or less users that did not get a lot of testing.
- wormlord6 months ago
  I always have the mindset of "its my fault". My Linux workstation constantly crashing because of the i9-13900k in it was honestly humiliating. Was very relieved when I realized it was the CPU and not some impossible to find code error.
  - dehrmann6 months ago
    Linux is like an abusive relationship in that way--it's always your fault.
- ajuc6 months ago
  It's healthier to assume your code is wrong than otherwise. But it's best to simply bisect the cause-effect chain a few more times and be sure.
- astrobe_6 months ago
  About "family and friends", a couple of times by fixing minor and a priori unrelated side issues, it revealed the bug I was after.
nox1016 months ago
> #1 Understand the system: Read the manual, read everything in depth, know the fundamentals, know the road map, understand your tools, and look up the details.
Maybe I'm mis-understand but "Read the manual, read everything in depth" sounds like. Oh, I have bug in my code, first read the entire manual of the library I'm using, all 700 pages, then read 7 books on the library details, now that a month or two has passed, go look at the bug.
I'd be curious if there's a single programmer that follows this advice.
- adolph6 months ago
  This was written in 2004, the year of Google's IPO. Atwood and Spolsky didn't found Stack Overflow until 2008. [0] People knew things as the "Camel book" [1] and generally just knew things.
  0. https://stackoverflow.blog/2021/12/14/podcast-400-an-oral-hi...
  1. https://www.perl.com/article/extracting-the-list-of-o-reilly...
- earnestinger6 months ago
  I think we have a bit different interpretations here.
  > read everything in depth
  Is not necessarily
  > first read the entire manual of the library I'm using, all 700 pages
  If I have problem with “git bisect”. I can go only to stackoverflow try several snippets and see what sticks, or I can also go to https://git-scm.com/docs/git-bisect to get a bit deeper knowledge on the topic.
- 6 months ago
  undefined
- feoren6 months ago
  Essentially yes, that's correct. Your mistake is thinking that the outcome of those months of work is being able to kinda-probably fix one single bug. No: the point of all that effort is to truly fix all the bugs of that kind (or as close to "all" as is feasible), and to stop writing them in the first place.
  The alternative is paradropping into an unknown system with a weird bug, messing randomly with things you don't understand until the tests turn green, and then submitting a PR and hoping you didn't just make everything even worse. It's never really knowing whether your system actually works or not. While I understand that is sometimes how it goes, doing that regularly is my nightmare.
  P.S. if the manual of a library you're using is 700 pages, you're probably using the wrong library.
  - earnestinger6 months ago
    > if the manual of a library you're using is 700 pages, you're probably using the wrong library.
    Statement bordering on papyrophobia. (Seems that is a real phobia)
    feoren6 months ago
    More like fear of needless complexity and over-broad scope, which is actually rampant in library design.
    earnestinger6 months ago
    700 page manual can be produced in several ways.
    - Scope of the library is massive
    - very peculiar way of writing, with lot of impressively unnecessary description of minute detail, that the reader starts counting not only sheep but also breaths until the ultimate end before finishing a sentence. (I.e. Unnecessarily verbose)
    - very extensive docs describing things from various angles including references, topic based how tos and such.
    (I agree that the last one is the least likely, but there is always hope)
- scudsworth6 months ago
  great strawman, guy that refuses to read documentation
  - mox16 months ago
    I mean he has a point. Things are incredibly complex now adays, I don't think most people have time to "understand the system."
    I would be much more interested in rules that don't start with that... Like "Rules for debugging when you don't have the capacity to fully understand every part of the system."
    Bisecting is a great example here. If you are Bisecting, by definition you don't fully understand the system (or you would know which change caused the problem!)
david_draco6 months ago
Step 10, add the bug as a test to the CI to prevent regressions? Make sure the CI fails before the fix and works after the fix.
- Tade06 months ago
  The largest purely JavaScript repo I ever worked on (150k LoC) had this rule and it was a life saver, particularly because the project had commits dating back more than five years and since it was a component/library, it had quite few strange hacks for IE.
- seanwilson6 months ago
  I don't think this is always worth it. Some tests can be time consuming or complex to write, have to be maintained, and we accept that a test suite won't be testing all edge cases anyway. A bug that made it to production can mean that particular bug might happen again, but it could be a silly mistake and no more likely to happen again than 100s of other potential silly mistakes. It depends, and writing tests isn't free.
  - Ragnarork6 months ago
    Writing tests isn't free but writing non-regression tests for bugs that were actually fixed is one of the best test cases to consider writing right away, before the bug is fixed. You'll be reproducing the bug anyway (so already consider how to reproduce). You'll also have the most information about it to make sure the test is well written anyway, after building a mental model around the bug.
    Writing tests isn't free, I agree, but in this case a good chunk of the cost of writing them will have already been paid in a way.
    jerf6 months ago
    For people who aren't getting the value of unit tests, this is my intro to the idea. You had to do some sort of testing on your code. At its core, the concept of unit testing is just, what if instead of throwing away that code, you kept it?
    To the extent that other concerns get in the way of the concept, like the general difficulty of testing that GUIs do what they are supposed to do, I don't blame the concept of unit testing; I blame the techs that make the testing hard.
    Ragnarork6 months ago
    I also think that this is a great way to emphasis their value.
    If anything I'd only keep those if it's hard to write them, if people push back against it (and I myself don't like them sometimes, e.g. when the goal is just to push up the coverage metric but without actually testing much, which only add test code to maintain but no real testing value...).
    Like any topic there's no universal truth and lots of ways to waste time and effort, but this specifically is extremely practical and useful in a very explicit manner: just fix it once and catch it the next time before production. Massively reduce the chance one thing has to be fixed twice or more.
    n144q6 months ago
    I can't count how many times when other people ask me "how can I use this API?", I just send a test case to them. Best example you can give to someone that is never out of sync.
    seanwilson6 months ago
    > Writing tests isn't free, I agree, but in this case a good chunk of the cost of writing them will have already been paid in a way.
    Some examples that come to mind are bugs to do with UI interactions, visuals/styling, external online APIs/services, gaming/simulation stuff, and asynchronous/thread code, where it can be a substantial effort to write tests for, vs fixing the bug that might just be a typo. This is really different compared to if you're testing some pure functions that only need a few inputs.
    It depends on what domain you're working in, but I find people very rarely mention how much work certain kinds of test can be to write, especially if there aren't similar tests written already and you have to do a ton of setup like mocking, factories, and UI automation.
    Ragnarork6 months ago
    Definitely agree with you on the fact that there are tests which are complicated to write and will take effort.
    But I think all other things considered my impression still holds, and that I should maybe rather say they're easier to write in a way, though not necessarily easy.
  - n144q6 months ago
    You (or other people) will thank yourself in a few months/years when refactoring the code, knowing that they don't need to worry about missing edge cases, because all known edge cases are covered with these non regression tests.
    seanwilson6 months ago
    There's really no situation you wouldn't write a test? Have you not had situations where writing the test would take a lot of effort vs the risk/impact of the bug it's checking for? Your test suite isn't going to be exhaustive anyway so there's always a balance of weighing up what's worth testing. Going overkill with tests can actually get in the way of refactoring as well when a change to the UI or API isn't done because it would require updating too many tests.
    n144q6 months ago
    I have, and in 90% of cases that's because my code is seriously messed up, so that (1) it is not easy to write a unit test (2) there has not been enough effort in testing so that mocks are not available when I need it.
    Adding tests is not easy, and you can always find excuses to not do that and instead "promise" to do it later, which almost never happens. I have seen enough to know this. Which is why I myself have put more work in writing unit tests, refactoring code and creating test mocks than anyone else in my team.
    And I can't tell you how much I appreciate it when I find that this has benefitted me personally -- when I need to write a new test, often I find that I can reuse 80% if not 80% of the test setup and focus on the test body.
    After adding the first test, it becomes much easier to add the second and third test. If you don't add the first test or put the effort into making your code actually testable, it's never going to be easy to test anything.
    seanwilson6 months ago
    I'm assuming here the code is written in a testable way, but the behaviour is hard or time-consuming to test.
    It's not about being lazy or making excuses, it's not free to write exhaustive tests and the resources have to come from somewhere. For MVPs, getting the first release out is going to be much more important for example.
    n144q6 months ago
    You are not supposed to spend a significant amount of time testing "behavior". Unit tests should be the vast majority of testing.
    And I can tell you I have first hand experience of an "MVP" product getting delayed, multiple times, because management realizes that nobody wants to purchase our product when they discover how buggy and unusable it is.
    seanwilson6 months ago
    We probably work on different kinds of projects and/or different kinds of teams? I'd rather have the majority being end-to-end/UI tests than unit tests most of the time because you're not going to know if your app or UI is broken otherwise. I tend to work in smaller teams where everyone is experienced, and use languages with decent type checking.
    It's very possible to write great software without any tests at all too. It's like people forget that coders used to write assembly code without source control, CI, mocking and all that other stuff.
- nonrandomstring6 months ago
  Yes, just more generally document it
  I've lost count of how many things i've fixed only to to see;
  1) It recurs because a deeper "cause" of the bug reactivated it.
  2) Nobody knew I fixed something so everyone continued to operate workarounds as if the bug was still there.
  I realise these are related and arguably already fall under "You didn't fix it". That said a bit of writing-up and root-cause analysis after getting to "It's fixed!" seems helpful to others.
- soco6 months ago
  What do you do with the years old bug fixes? How fast can one run the CI after a long while of accumulating tests? Do they still make sense to be kept in the long run?
  - hobs6 months ago
    Why would you want to stop knowing that your old bug fixes still worked in the context of your system?
    Saying "oh its been good for awhile now" has nothing to do with breaking it in the future.
  - jerf6 months ago
    I'm not particularly passionate about arguing the exact details of "unit" versus "integration" testing, let alone breaking down the granularity beyond that as some do, but I am passionate that they need to be fast, and this is why. By that, I mean, it is a perfectly viable use of engineering time to make changes that deliberately make running the tests faster.
    A lot of slow tests are slow because nobody has even tried to speed them up. They just wrote something that worked, probably literally years ago, that does something horrible like fully build a docker container and fully initialize a complicated database and fully do an install of the system and starts processes for everything and liberally uses "sleep"-based concurrency control and so on and so forth, which was fine when you were doing that 5 times but becomes a problem when you're trying to run it hundreds of times, and that's a problem, because we really ought to be running it hundreds of thousands or millions of times.
    I would love to work on a project where we had so many well-optimized automated tests that despite their speed they were still a problem for building. I'm sure there's a few out there, but I doubt it's many.
  - simmonmt6 months ago
    This is a great problem to have, if (IME) rare. Step 1 Understand the System helps you figure out when tests can be eliminated as no longer relevant and/or which tests can be merged.
  - gregthelaw6 months ago
    I would say yes, your CI should accumulate all of those regression tests. Where I work we now have many, many thousands of regression test cases. There's a subset to be run prior to merge which runs in reasonable time, but the full CI just cycles through.
    For this to work all the regression tests must be fast, and 100% reliable. It's worth it though. If the mistake was made once, unless there's a regression test to catch it, it'll be made again at some point.
    lelanthran6 months ago
    > For this to work all the regression tests must be fast,
    Doesn't matter how fast it is, if you're continually adding tests for every single line of code introduced eventually it will get so slow you will want to prune away old tests.
  - hsbauauvhabzb6 months ago
    I think for some types of bugs a CI test would be valuable if the developer believes regressions may occur, for other bugs they would be useless.
sitkack6 months ago
If folks want to instill this mindset in their kids, themselves or others I would recommend at least
The Martian by Andy Weir https://en.wikipedia.org/wiki/The_Martian_(Weir_novel)
https://en.wikipedia.org/wiki/Zen_and_the_Art_of_Motorcycle_...
https://en.wikipedia.org/wiki/The_Three-Body_Problem_(novel)
To Engineer Is Human - The Role of Failure in Successful Design By Henry Petroski https://pressbooks.bccampus.ca/engineeringinsociety/front-ma...
https://en.wikipedia.org/wiki/Surely_You%27re_Joking,_Mr._Fe...!
- deepspace6 months ago
  Agree. I think Zen and the Art of Motorcycle Maintenance encapsulates the art of troubleshooting the best. Especially the concept of "gumption traps",
  "What you have to do, if you get caught in this gumption trap of value rigidity, is slow down...you're going to have to slow down anyway whether you want to or not...but slow down deliberately and go over ground that you've been over before to see if the things you thought were important were really important and to -- well -- just stare at the machine. There's nothing wrong with that. Just live with it for a while. Watch it the way you watch a line when fishing and before long, as sure as you live, you'll get a little nibble, a little fact asking in a timid, humble way if you're interested in it. That's the way the world keeps on happening. Be interested in it."
  Words to live by
  - fud1016 months ago
    i'm slogging through zen, it's a bit trite so far (opening pages). im struggling to continue. when will it stop talking about the climate and blackbirds and start saying something interesting?
    deepspace6 months ago
    Yes it starts slow, but keep at it. It starts to get interesting about halfway into the book, if I remember correctly.
    sitkack6 months ago
    One also doesn't need to enjoy something to get something very worthwhile out of it. Not apologizing for Zen, but all books have rough patches to someone.
  - sitkack6 months ago
    Yeah, it is probably what I would start with and all the messages the book is sending you will resurface in the others. You have to cultivate that debugging mind, but once it starts to grow, it can't be stopped.
- nextlevelwizard6 months ago
  What part of Three Body Problem has anything to do with debugging or troubleshooting or any form of planning?
  Things just sort of happen with wild leaps of logic. The book is actually a fantasy book with thinnest layer of science babble on top.
knlb6 months ago
I wrote a fairly similar take on this a few years ago (without having read the original book mentioned here) -- https://explog.in/notes/debugging.html
Julia Evans also has a very nice zine on debugging: https://wizardzines.com/zines/debugging-guide/
- Hackbraten6 months ago
  I love Julia Evans’ zine! Bought several copies when it came out, gave some to coworkers and donated one to our office library.
gnufx6 months ago
Then, after successful debugging your job isn't finished. The outline of "Three Questions About Each Bug You Find" <http://www.multicians.org/thvv/threeq.html> is:
1. Is this mistake somewhere else also?
2. What next bug is hidden behind this one?
3. What should I do to prevent bugs like this?
Zolomon6 months ago
I have been bitten more than once thinking that my initial assumption was correct, diving deeper and deeper - only to realize I had to ascend and look outside of the rabbit hole to find the actual issue.
> Assumption is the mother of all screwups.
- sitkack6 months ago
  This is how I view debugging, aligning my mental model with how the system actually works. Assumptions are bugs in the mental model. The problem is conflating what is knowledge with what is an assumption.
- astrobe_6 months ago
  I've once heard from an RTS game caster (IIRC it was Day9 about Starcraft) "Assuming... Is killing you".
- kmoser6 months ago
  When playing Captain Obvious (i.e. the human rubber duck) with other devs, every time they state something to be true my response is, "prove it!" It's amazing how quickly you find bugs when somebody else is making you question your assumptions.
fn-mote6 months ago
The article is a 2024 "review" (really more of a very brief summary) of a 2002 book about debugging.
The list is fun for us to look at because it is so familiar. The enticement to read the book is the stories it contains. Plus the hope that it will make our juniors more capable of handling complex situations that require meticulous care...
The discussion on the article looks nice but the submitted title breaks the HN rule about numbering (IMO). It's a catchy take on the post anyway. I doubt I would have looked at a more mundane title.
- bananapub6 months ago
  > The article is a 2024 "review"
  2004.
condour756 months ago
One good timesaver: debug in the easiest environment that you can reproduce the bug in. For instance, if it’s an issue with a website on an iPad, first see if you reproduce in chrome using the responsive tools in web developer. If that doesn’t work, see if it reproduces in desktop safari. Then the iPad simulator, and only then the real hardware. Saves a lot of frustration and time, and each step towards the actual hardware eliminates a whole category of bugs.
astrobe_6 months ago
Also sometimes: the bug is not in the code, its in the data.
A few times I looked for a bug like "something is not happening when it should" or "This is not the expected result", when the issue was with some config file, database records, or thing sent by a server.
For instance, particularly nasty are non-printable characters in text files that you don't see when you open the file.
"simulate the failure" is sometimes useful, actually. Ask yourself "how would I implement this behavior", maybe even do it.
Also: never reason on the absence of a specific log line. The logs can be wrong (bugged) too, sometimes. If you printf-debugging a problem around a conditional for instance, log both branches.
waynecochran6 months ago
I also think it is worthwhile stepping thru working code with a debugger. The actual control flow reveals what is actually happening and will tell you how to improve the code. It is also a great way to demystify how other's code runs.
- hugograffiti6 months ago
  I agree and have found using a time travel debugger very useful because you can go backwards and forwards to figure out exactly what the code is doing. I made a video of me using our debugger to compare two recordings - one where the program worked and one where a very intermittent bug occurred. This was in code I was completely unfamiliar with so would have been hard for me to figure out without this. The video is pretty rubbish to be honest - I could never work in sales - but if you skip the first few minutes it might give you a flavour of what you can do. (I basically started at the end - where it failed - and worked backwards comparing the good and bad recordings) https://www.youtube.com/watch?v=GyKrDvQ2DdI
  - waynecochran6 months ago
    To go backwards, don't you have to save the previous states of the machine. This always seemed long a strong limitation.
- ajross6 months ago
  I think that fits nicely under rule 1 ("Understand the system"). The rules aren't about tools and methods, they're about core tasks and the reason behind them.
- sumtechguy6 months ago
  That is rule #3. quit thinking and look. Use whatever tool you need and look at what is going on. The next few rules (4-6) are what you need to do while you are doing step #3.
- nthingtohide6 months ago
  Make sure through pure logic that you have correctly identified the Root Cause. Don't fix other probable causes. This is very important.
- nottorp6 months ago
  This is necessary sometimes when you’re simply working on an existing code base.
jwpapi6 months ago
I’m not sure that doesn’t sit well with me.
Rule 1 should be: Reproduce with most minimal setup.
99% you’ll already have found the bug.
1% for me was a font that couldn’t do a combination of letters in a row. life ft, just didn’t work and thats why it made mistakes in the PDF.
No way I could’ve ever known that if I wouldn’t have reproduced it down to the letter.
Just split code in half till you find what’s the exact part that goes wrong.
- physicles6 months ago
  Related: decrease your iterating time as much as possible. If you can test your fix in 30 seconds vs 5 minutes, you’ll fix it in hours instead of days.
- 1010116 months ago
  Rule 4 is divide and conquer, which is the 'splitting code in half' you reference.
  I'd argue that you can't effectively split something in half unless you first understand the system.
  The book itself really is wonderful - the author is quite approachable and anything but dogmatic.
BWStearns6 months ago
> Check the plug
I just spent a whole day trying to figure out what was going on with a radio. Turns out I had tx/rx swapped. When I went to check tx/rx alignment I misread the documentation in the same way as the first. So, I would even add "try switching things anyways" to the list. If you have solid (but wrong) reasoning for why you did something then you won't see the error later even if it's right in front of you.
- SoftTalker6 months ago
  Yes the human brain can really be blind when its a priori assumptions turn out to be wrong.
ianmcgowan6 months ago
I used to manage a team that supported an online banking platform and gave a copy of this book to each new team member. If nothing else, it helped create a shared vocabulary.
It's useful to get the poster and make sure everyone knows the rules.
https://debuggingrules.com/download-the-poster/
teleforce6 months ago
The tenth golden rule:
10) Enable frame pointers [1].
[1] The return of the frame pointers:
https://news.ycombinator.com/item?id=39731824
spawarotti6 months ago
Very good online course on debugging: Software Debugging on Udacity by Andreas Zeller
https://www.udacity.com/course/debugging--cs259
- belter6 months ago
  Udacity is owned by Accenture? That is...Surprising.
omkar-foss6 months ago
For folks who love to read books, here's an excerpt from the Debugging book's accompanying website (https://debuggingrules.com/):
"Dave was asked as the author of Debugging to create a list of 5 books he would recommend to fans, and came up with this.
https://shepherd.com/best-books/to-give-engineers-new-perspe..."
- burrish6 months ago
  thanks for the link
  - omkar-foss6 months ago
    Most welcome :)
pcblues6 months ago
Over twenty five odd years, I have found the path to a general debugging prowess can best be achieved by doing it. I'd recommend taking the list/buying the book, using https://up-for-grabs.net to find bugs on github/bugzilla, etc. and doing the following:
1. set up the dev environment
2. fork/clone the code
3. create a new branch to make changes and tests
4. use the list to try to find the root cause
5. create a pull request if you think you have fixed the bug
And use Rule 0 from GuB-42: Don't panic
(edited for line breaks)
apples_oranges6 months ago
A good bug is the most fun thing about software development
- jamesblonde6 months ago
  Just let a LLM create even better bugs for you - Erik Meijer https://www.youtube.com/live/SsJqmV3Wtkg?si=MUoiNbWpsunsZ39y...
- waynecochran6 months ago
  Sometimes I am actually happy when there is a obvious bug. It is like solving a murder mystery.
  - BobbyTables26 months ago
    And often you’re the culprit too!
    D-Coder6 months ago
    The victim, the murderer, and the detective.
    6 months ago
    undefined
analog316 months ago
One I learned on Friday: Check your solder connections under a microscope before hacking the firmware.
- InitialLastName6 months ago
  The worst is when it works only when your oscilloscope probe is pushing down on the pin.
sandbar6 months ago
Take the time to speed up my iteration cycles has always been incredibly valuable. It can be really painful because its not directly contributing to determining/fixing the bug (which could be exacerbated if there is external pressure), but its always been worth it. Of course, this only applies to instances where it takes ~4+ minutes to run a single 'experiment' (test, startup etc). I find when I do just try to push through with long running tests I'll often forget the exact variable I tweaked during the course of the run. Further, these tweaks can be very nuanced and require you to maintain a lot of the larger system in your head.
manhnt6 months ago
> Make it fail: Do it again, start at the beginning, stimulate the failure, don't simulate the failure, find the uncontrolled condition that makes it intermittent, record everything and find the signature of intermittent bugs
Unfortunately, I found many times this is actually the most difficult step. I've lost count of how many times our QA reported an intermittent bug in their env, only to never be able to reproduce it again in the lab. Until it hits 1 or 2 customer in the field, but then when we try to take a look at customer's env, it's gone and we don't know when it could come back again.
duxup6 months ago
I’m so bad at #1.
I know it is the best route, I do know the system (maybe I wrote it) and yet time and again I don’t take the time to read what I should… and I make assumptions in hopes of speeding up the process/ fix, and I cost myself time…
__MatrixMan__6 months ago
> Check that it's really fixed, check that it's really your fix that fixed it, know that it never just goes away by itself
I wish this were true, and maybe it was in 2004, but when you've got noise coming in from the cloud provider and noise coming in from all of your vendors I think it's actually quite likely that you'll see a failure once and never again.
I know I've fixed things for people without without asking if they ever noticed it was broken, and I'm sure people are doing that to me also.
goshx6 months ago
> Quit thinking and look (get data first, don't just do complicated repairs based on guessing)
From my experience, this is the single most important part of the process. Once you keep in mind that nothing paranormal ever happens in systems and everything has an explanation, it is your job to find the reason for things, not guess them.
I tell my team: just put your brain aside and start following the flow of events checking the data and eventually you will find where things mismatch.
- drivers996 months ago
  There's a book I love and always talk about called "Stop Guessing: The 9 Behaviors of Great Problem Solvers" by Nat Greene. It's coincidental, I guess, that they both have 9 steps. Some of the steps are similar so I think the two books would be complementary, so I'm going to check out "Debugging" as well.
- throwawayfks6 months ago
  I worked at a place once where the process was "Quit thinking, and have a meeting where everyone speculates about what it might be." "Everyone" included all the nontechnical staff to whom the computer might as well be magic, and all the engineers who were sitting there guessing and as a consequence not at a keyboard looking.
  I don't miss working there.
- pbalau6 months ago
  Acquire a rubber duck. Teach the duck how the system works.
kazinator6 months ago
I've had trouble keeping the audit trail. It can distract from the flow of debugging, and there can be lots of details to it, many of which end up being irrelevant; i.e. all the blind rabbit holes that were not on the maze path to the bug. Unless you're a consultant who needs to account for the hours, or a teller of engaging debugging war stories, the red herrings and blind alleys are not that useful later.
Tepix6 months ago
My first rule for debugging debutants:
Don't be too embarassed to scatter debug logmessages in the code. It helps.
My second rule:
Don't forget to remove them when you're done.
- lanstin6 months ago
  My rule for a long time has been anytime I add a print or log, except for the first time I am writing some new cide with tricky logic, which I try not to do, never delete it. Lower it to the lowest possible debug or trace level but if it was useful once it will be useful again, even if only to document the flow thru the code on full debug.
  The nicest log package I had would always count the number of times a log msg was hit even if the debug level meant nothing happened. The C preprocessor made this easy, haven't been able to get a short way to do this counting efficiently in other languages.
  - sitkack6 months ago
    I really like this.
shahzaibmushtaq6 months ago
I can't comment further on David A. Wheeler's review because his words were from 2004 (He said everything true), and I can't comment on the book either because I haven't read it yet.
Thank you for introducing me to this book.
One of my favorite rules of debugging is to read the code in plain language. If the words don't make sense somewhere, you have found the problem or part of it.
samsquire6 months ago
One thing I have been doing is to create a directory called "debug" from the software and write lots of different files when the main program has executed to add debugging information but only write files outside of hot loops for debugging and then visually inspect the logs when the program is exited.
For intermediate representations this is better than printf to stdout
_madmax_6 months ago
I had the incredible luck to stumble upon this book early in my career and it helped me tremendously in so many ways. If I could name only one it would be that it helped me get over the sentiment of being helpless in front of a difficult situation. This book brought me to peace with imperfection and me being an artisan of imperfection.
andypi_swfc6 months ago
I found this book so helpful I created a worksheet based on it - might be helpful for some: https://andypi.co.uk/2024/01/26/concise-guide-to-debugging-a...
ChrisMarshallNY6 months ago
#7 Check the plug: Question your assumptions, start at the beginning, and test the tool.
I have found that 90% of network problems, are bad cables.
That's not an exaggeration. Most IT folks I know, throw out ethernet cables immediately. They don't bother testing them. They just toss 'em in the trash, and break a new one out of the package.
- nickcw6 months ago
  I prefer to cut the connectors off with savage vengeance before tossing the faulty cable ;-)
BlueUmarell6 months ago
Post: "9 rules of debugging"
Each comment: "..and this is my 10th rule: <insert witty rule>"
Total number of rules when reaching the end of the post: 9 + n + n * m, with n being number of users commenting, m being the number of users not posting but still mentally commenting on the other users' comments.
- 6 months ago
  undefined
TheLockranore6 months ago
Rule 11: If you haven't solved it and reach this rule, one of your assertions is incorrect. Start over.
reverendsteveii6 months ago
Review was good enough to make me snag the entire book. I'm taking a break from algorithmic content for a bit and this will help. Besides, I've got an OOM bug at work and it will be fun to formalize the steps of troubleshooting it. Thanks, OP!
- sumtechguy6 months ago
  I recommend this book to all Jr. devs. Many feel very overwhelmed by the process. Putting it into nice interesting stories and how to be methodical is a good lesson for everyone.
jgrahamc6 months ago
Wasn't Bryan Cantrill writing a book about debugging? I'd love to read that.
- bcantrill6 months ago
  I was! (Along with co-author Dave Pacheco.) And I still have the dream that we'll finish it one day: we had written probably a third of it, but then life intervened in various dimensions. And indeed, as part of our preparation to write our book (which we titled The Joy of Debugging), we read Wheeler's Debugging. On the one hand, I think it's great to have anything written about debugging, as it's a subject that has not been treated with the weight that it deserves. But on the other, the "methodology" here is really more of a collection of aphorisms; if folks find it helpful, great -- but I came away from Debugging thinking that the canonical book on debugging has yet to be written.
  Fortunately, my efforts with Dave weren't for naught: as part of testing our own ideas on the subject, I gave a series of presentations from ~2015 to ~2017 that described our thinking. A talk that pulls many of these together is my GOTO Chicago talk in 2017, on debugging production systems.[0] That talk doesn't incorporate all of our thinking, but I think it gets to a lot of it -- and I do think it stands at a bit of contrast to Wheeler's work.
  [0] https://www.youtube.com/watch?v=30jNsCVLpAE
  - gregthelaw6 months ago
    It's a great talk! I have stolen your "if you smell smoke, find the source" advice and put it in some of my own talks on the subject.
urbandw311er6 months ago
Rule #10 - it’s probably DNS
- jsrcout6 months ago
  I worked at place in the late 90s where that was true, at least for anything Internet related. We were doing (oh so primitive by today's standards...) Web development and it happened so many times. I'd call downstairs and they'd swear DNS was fine, and then 20 minutes to half an hour later, it would all be mysteriously working again. But only if we called down heh heh.
  On an unrelated note, one of the folks down there explained the DNS setup once and it was like something out of a Stephen King novel. They'd even been told by a recognized industry expert (whose name I sadly can't remember any more) that what they needed to do was impossible, but they still did it. Somehow.
  They really were great folks, they just had that one quirk but after a while I could just chuckle about it.
- tmountain6 months ago
  Years ago, my boss thought he was being clever and set our server’s DNS to the root nameservers. We kept getting sporadic timeouts on requests. That took a while to track down… I think I got a pizza out of the deal.
  - urbandw311er6 months ago
    That sounds hellish
dalton_zk6 months ago
First time hearing about these 9 rules, but I learning most of them by experience with many years resolving or trying to resolved bugs.
Only thing that I dont agree is the book cost US$ 4.291,04 on Amazon
- dalton_zk6 months ago
  btw the hardcover its this price
fasten6 months ago
Nice classic that sticks to timeless pricniples. the nine rules are practical with war stories that make them stick. but agree that "don't panic" should be added
PhunkyPhil6 months ago
I would almost change 4 into "Binary search".
Wheeler gets close to it by suggesting to locate which side of the bug you're on, but often I find myself doing this recursively until I locate it.
- ajuc6 months ago
  Yeah people say use git bisect but that's another dimension (which change introduced the bug).
  Bisecting is just as useful when searching for the layer of application which has the bug (including external libraries, OS, hardware, etc.) or data ranges that trigger the bug. There's just no handy tools like git bisect for that. So this amounts to writing down what you tested and removing the possibilities that you excluded with each test.
berikv6 months ago
Personally, I’d start with divide and conquer. If you’re working on a relevant code base chances are that you can’t learn all the API spec and documentation because it’s just too much.
- berikv6 months ago
  Also: Fix every bug twice: Both the implementation and the “call site” — if at all possible
  - BobbyTables26 months ago
    Ye ol’ “belt and suspenders” approach?
- causal6 months ago
  Check the plug should be first
mootoday6 months ago
Did anyone say debugging?
I've followed https://debugbetter.com/ for a few weeks and the content has been great!
- 6 months ago
  undefined
6 months ago
undefined
jagged-chisel6 months ago
> Ask for fresh insights (just explaining the problem to a mannequin may help!)
You can’t trust a thing this person says if they’re not recommending a duck.
6 months ago
undefined
k3vinw6 months ago
The unspoken rule is talking to the rubber duck :)
- zarq6 months ago
  That is literally #8 in the list
  - dkdbejwi3836 months ago
    Rule #10 - read everything twice
  - k3vinw6 months ago
    Hmm. Perhaps, but mannequin is not nearly as whimsical sounding as a rubber duck which inspires you to bounce your ideas off of the inanimate object.
gregthelaw6 months ago
I love the "if you didn't fix it, it ain't fixed". It's too easy to convince yourself something is fixed when you haven't fully root-caused it. If you don't understand exactly how the thing your seeing manifested, papering over the cracks will only cause more pain later on.
As someone who has been working on a debugging tool (https://undo.io) for close to two decades now, I totally agree that it's just weird how little attention debugging as a whole gets. I'm somewhat encouraged to see this topic staying near the top of hacker news for as long as it has.
- bch6 months ago
  > If you didn't fix it, it ain't fixed
  AKA: “Problems that go away by themselves come back by themselves.”
fedeb956 months ago
rule -1: don't trust the bug issuer
- nextlevelwizard6 months ago
  bad rule for bad programmers.
  trust, but verify.
  - fedeb956 months ago
    yes, that was the point.
begueradj6 months ago
This is related to the classic debugging book with the same title. I first discovered it here in HN.
__mharrison__6 months ago
Go on a walk or take a shower...
nottorp6 months ago
I’d add “a logging module done today will save you a lot of overtime next year”.
ChrisArchitect6 months ago
(2004)
Title is: David A. Wheeler's Review of Debugging by David J. Agans
worldhistory6 months ago
Great book +1
coldtea6 months ago
>Rule 1: Understand the system: Read the manual, read everything in depth (...)
Yeah, ain't nobody got time for that. If e.g. debugging a compile issue meant we read the compiler manual, we'd get nothing done...
013081069916 months ago
Halo
pbertrand4296 months ago
[dead]
6 months ago
undefined
khana6 months ago
[dead]