Practices of Reliable Software Design(entropicthoughts.com)

231 pointsby fagnerbrack9 months ago15 comments

nostrademons9 months ago
There is a bunch of good advice here, but it's missed the most useful principal in my experience, probably because the motivating example is too small in scope:
The way to build reliable software systems is to have multiple independent paths to success.
This is the Erlang "let it crash" strategy restated, but I've also found it embodied in things like the architecture of Google Search, Tandem Computer, Ethereum, RAID 5, the Space Shuttle, etc. Basically, you achieve reliability through redundancy. For any given task, compute the answer multiple times in parallel, ideally in multiple independent ways. If the answer agrees, great, you're done. If not, have some consensus mechanism to detect the true answer. If you can't compute the answer in parallel, or you still don't get one back, retry.
The reason for this is simply math. If you have n different events that must all go right to achieve success, the chance of this happening is x1 * x2 * ... * xn. This product goes to zero very quickly - if you have 20 components connected in series that are all 98% reliable, the chance of success is only 2/3. If instead you have n different events where any one can go right to achieve success, the chance of success is 1 - (1 - y1) * (1 - y2) * ... * (1 - yn). This inverse actually increases as the number of alternate pathways to success goes up and fast. If you have 3 alternatives each of which has just an 80% chance of success, but any of the 3 will work, then doing them all in parallel has a 97% chance of success.
This is why complex software systems that must stay up are built with redundancy, replicas, failover, retries, and other similar mechanisms in place. And the presence of those mechanisms usually trumps anything you can do to increase the reliability of individual components, simply because you get diminishing returns to carefulness. You might spend 100x more resources to go from 90% reliability to 99% reliability, but if you can identify a system boundary and correctness check, you can get that 99% reliability simply by having 2 teams each build a subsystem that is 90% reliable and checking that their answers agree.
- kqr9 months ago
  I disagree somewhat, influenced by the teachings of Nancy Leveson.
  In the 1930's, yes, component redundancy was the way to reliability. This worked at the time because components were flaky and technical systems were simple aggregations of components. Today, components themselves are more reliable, but even when they are not, redundancy adds only a little reliability because there's a new, large, source of failure: interactive complexity.
  Today's systems are so complicated that many failures stem from insufficient, misunderstood, or ambiguous specifications. These errors happen not because a component failed -- all components work exactly as they were intended to -- it is only that in their intended interactions they produce an unintended result. Failure is an emergent property.
  The solution is to approach reliability from a system theoretic perspective. This very early draft contains the core of the idea, but not yet fleshed out or edited: https://entropicthoughts.com/root-cause-analysis-youre-doing...
  - nostrademons9 months ago
    This is why Erlang's OTP focuses on supervisor trees. At each level of the component hierarchy, you have redundancy. Subcomponents themselves may have interactive complexity, but a failure or misspecification in any of the interactions making up that subcomponent simply makes that subcomponent fail. This failure is handled at a higher level by doing something simpler.
    And "do something simpler" is actually a core part of this strategy. You're right that "today's systems are so complicated that many failures stem from insufficient, misunderstood, or ambiguous specifications". In most cases, yesterday's system worked just fine, you just can't sell it as a competitive advantage. So build simple, well-understood subsystems as fallbacks to the complex bleeding-edge systems, or even just take the software that's been working for a decade.
  - detourdog9 months ago
    I have the opinion that todays very complicated system is a symptom of over complication for the problem at hand.
    I’m working on the idea that there is better set of assumptions to use for directing technical development.
    gukov9 months ago
    Systems are not built in one go. They usually start out simple enough and become complex over time.
- yen2239 months ago
  In the limit, there is a hard tradeoff between efficiency and reliability.
  Failovers, redundancies, and backups are all important for building systems that are resilient in the face of problems, for reasons you've pointed out.
  However, failovers, redundancies and backups are inefficient. Solving a problem with 1 thing is always going to be more efficient that solving the same problem with 10 things.
  It's interesting to see this tradeoff play out in real-life. We see people coalescing around one or two services because that's the most efficient path, and then we see them diversifying across multiple services once bad things happen to the centralised services.
  - nostrademons9 months ago
    This is a very important point, and often misunderstood on both a business & societal level. Reliability has a cost. If you optimize all redundancy out of a system, you find that the system becomes brittle, unreliable, and prone to failure. Companies like 3M and Boeing have found that in the pursuit of higher profits, they've lost their focus on quality and suffer the resulting loss of trust and brand damage. The developed world discovered that with COVID, our just--in-time efficiency meant that any hiccup anywhere in the supply chain meant mass shortages of goods.
  - marcosdumay9 months ago
    > In the limit, there is a hard tradeoff between efficiency and reliability.
    Yes, but notice that most things on the GP's comment have an exponential impact on reliability (well, on 1 - reliability), so they are often non-brainiers as long as they follow that simple model (what they stop doing at some point).
  - hippich9 months ago
    Imho, the problem is that it is hard to estimate trade-offs. Optimizations (not just in computer systems, but in general) often seen as risk-free, when in reality they are not. More often than not one will be celebrated for optimization, and rarely for resilience (dubbed as duplicate, useless work)
- marcosdumay9 months ago
  As always, life is not that simple, and redundant components can interact in harmful ways, correctness checks can create incorrectness, process managers or consensus algorithms can amplify small problems...
  Just like every technique on the article also can turn out to reduce your reliability too.
- ramchip9 months ago
  > For any given task, compute the answer multiple times in parallel, ideally in multiple independent ways.
  Just to be clear, while this particular technique is valid and used in space software, it isn't common at all in Erlang and not part of the "let it crash" principle.
- alexpetralia9 months ago
  Interestingly this is exactly how I've come to define truth/correctness: https://alexpetralia.com/2023/01/25/how-do-we-know-if-data-i...
- pistoleer9 months ago
  Who will replicate the consensus checker?
  - the_sleaze_9 months ago
    Because he's the failover Gotham deserves, but not the validator it needs right now
- manvillej9 months ago
  the simple basic reality of statistics, a binomial distribution.
  5 independent systems with 90% chance of success is mathematically as reliable as one that is 99.999%.
  100x 90% systems would get you to 100 "9s" of reliability aka 99.99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999%
  - kqr9 months ago
    Except the binomial assumption obviously does not hold because
    (a) failures are correlated, not independent, and
    (b) many failures happen not at the component level but at the plane where components interact, and regardless of how much redundancy there is at the component level, there is ultimately just one plane at which they finally interact to produce a result.
  - lucianbr9 months ago
    Actually making 5 completely independent systems would be exceptionally hard. No shared code or team members, no shared hardware... For example, what 5 computing platforms would you use? x86, ARM, RISC-V and...?
    Math rarely applies so easily to real life. Talking about "independent" systems is cheap.
    If at all possible. How would you transport yourself to work using two independent systems?
    nostrademons9 months ago
    It's relatively simple at the organizational level, just expensive (but linearly expensive, while often increasing subcomponent reliability is exponentially expensive!). Just give the same problem statement to two independent teams with two different managers, have a clear output format and success criteria, and let them make all their technical decisions independently.
    Your example of "how do you transport yourself to work using two independent systems" is actually very apropos, because I and many other commuters do exactly that. If the highway is backed up, I bypass it with local roads. If everything is gridlock, I take public transportation. If public transportation isn't functioning (and it generally takes a natural disaster to knock out all the roads and public transportation, but natural disasters have happened), I work from home and telecommute. Each of these subsystems is less favored than the alternative, but it'll get me to work.
    lucianbr9 months ago
    While these are reasonable approaches, I do not think they live up to the mathematical meaning of "independent", and so invalidate the chances calculation.
    Your two teams might well both use in some place in the system the same hardware or software component. This will make the probability of failure between the systems not be completely independent, for all that you paid two teams and they worked separately. Spent a lot of money, and the results will not be as expected. If they both use x86 Intel, and a Meltdown kind of thing happens, your "independent" systems will both fail from the same cause.
    The transport analogy works great if you somehow imagine the transportation to be instantaneous, and only the decision to matter. But if you are already on a train and the train is delayed, you are not walking back home and taking the car. You have multiple options for transport, but you do not have a system built of independent components. You are not using the train and the car and the highway and the local roads all simultaneously.
    I don't think you understand the requirements for the formula you wrote to be valid. Your examples do not fit, for all that they are reasonable and useful approaches. Your actual reliability with these approaches falls way below the multiple nines you think of.
- amelius9 months ago
  > The way to build reliable software systems is to have multiple independent paths to success.
  That's a heuristic that might work sometimes.
  If you really want to build reliable software systems, then at least prove them correct. There are some tools and methodologies that can help you with this. And of course even a proof isn't everything since your assumptions can still be wrong (but in more subtle ways).
bruce5119 months ago
The first point is one that resonates strongly with me. Counter-intuitivly, the first instinct of a programmer should be "buy that, don't write it"
Of course, as a programmer, this is by far not my first instinct. I am a programmer, my function is programming, not purchasing.
Of course buying something is always cheaper (compared to the cost of my time) and will be orders of magnitude cheaper once the costs to maintain written-by-me code is added in.
Things that are bought -tend- to last longer too. If I leave my job I leave behind a bunch of custom code nobody wants to work on. If I leave Redis behind, well, the next guy just carries on running Redis.
I know all this. I advocate for all this. But I'm a programmer, send coders gotta code:) do it's not like we buy everything, I'm still there, still writing.
Hopefully though my emphasis is on adding value. Build things that others will take over one-day. Keep designs clean, and code cleaner.
And if I add one 'practice' to the list; Don't Be Clever. Clever code is hard to read, hard to understand, hard to maintain. Keep all code as simple as it can be. Reliable software is software that mostly isn't trying to be too clever.
- VyseofArcadia9 months ago
  I'm not sure I 100% agree.
  I've been thinking a lot lately about the cost of off-the-shelf solutions from the perspective of sustainability, and there is a cost beyond money. The performance of software almost always degrades over time. By buying Foo off-the-shelf, you are saying, "I am ok with getting on the same bloat-dictated hardware upgrade cycle as Foo."
  Of course you have the option of buying Foo and never upgrading, unless Foo has a license that forces you to. But that also walls you off from security bugfixes. But by replicating the essential features of Foo in-house, you can actually set and stick to a complexity and performance budget.
  Of course if you are a business of any real size, you're already on the hardware upgrade treadmill anyway, and probably all of your customers are too, so what does it matter if your software is a little slower and a little more resource hungry year after year after year? Other than maybe a little twinge of guilt every now and then.
  - bruce5119 months ago
    Certainly nothing is free (not even Free Software.) So there will surely be times when building is better than buying.
    I suppose the key is to understand the hidden costs with both approaches. The salary vs subscription cost is part of it, but there also subtle things, like flexibility (or lack thereof in bought systems), security (or lack thereof in homegrown systems) and so on.
- aitchnyu9 months ago
  This topic deserves an article on its own. I feel my team crossed "the line" on a SaaS that hosts docs from our Openapi and page doesnt even refresh safely. But how do we define the line?
  - kqr9 months ago
    The line is where the cost of building is less than that of buying. It sounds like in your case building would have been cheaper, given the simplicity of the problem and the quality issues with the purchased solution.
    It does get difficult in more complicated cases thanks to a lack of information on what a good solution looks like. This article attempts to straighten it out a little: https://entropicthoughts.com/build-vs-buy
    marcosdumay9 months ago
    > The line is where the cost of building is less than that of buying.
    Yes, once you factor in transaction costs, integration costs, risks contamination from that 3rd party, risks from lack of value alignment with that 3rd party (remember the Unity game engine?)...
    Or, in other words, people that say that phrase you said very often don't know the actual cost of buying. But well, nobody knows the actual cost of building before they try either.
    zemvpferreira9 months ago
    I would personally replace 'cost of building' with 'cost of maintaining', but otherwise agree with your reasoning. It's worth building in a factor of safety, such that I would formulate this idea as:
    Only build software if the cost of maintaining it is 1/3 or less than the cost of buying a license.
    (this has the nice second-order effect of being more robust to errors in the maintenance estimate, hence making it quicker to estimate).
- eikenberry9 months ago
  > Counter-intuitivly, the first instinct of a programmer should be "buy that, don't write it"
  I don't think this is counter intuitive at all... this is the whole premise behind free software. Why write it yourself when someone else already has and there is a community around using and updating it. We all buy the vast majority of our software and it is usually our go to move, unless there is an itch.
  - bruce5119 months ago
    In the context of you, at home, wanting to get stuff done, I agree.
    But in the context of "you're at work paid to be a programmer" the instinct is to look for things to program.
taeric9 months ago
This misses one of the key things I have seen that really drives reliable software. Actually rely on the software.
It sucks, because nobody likes the idea of the "squeaky wheel getting the grease." At the same time, nobody is surprised that the yard equipment that they haven't used in a year or so is going to need effort to get back to working. The longer it has been since it was relied on to work, the more likely that it won't work.
To that end, I'm not arguing that all things should be the critical path. But the more code you have that isn't regularly exercised, the more likely it will be broken if anything around it changes.
- LorenPechtel9 months ago
  Yup. Dogfood everything you can. Too often I've seen things that could never have made it out the door if whoever designed them actually used them or worked with those who used them.
l5870uoo9y9 months ago
I would add a ninth practice; throw errors. You find and fix them as opposed to errors that go silently unnoticed in the code base.
- AnimalMuppet9 months ago
  Fail early, and fail noisily. Don't fail silently.
  - LorenPechtel9 months ago
    Yes, .NET. I really love how an uncaught exception in a secondary thread simply causes a silent termination of the thread. In the development environment (C#) things work normally but a release version silently eats them.
    neonsunset9 months ago
    Perhaps you meant Task<T>?
    var thread = new Thread(() => { Thread.Sleep(10); throw new Exception("Uh oh!"); }); thread.Start(); Thread.Sleep(100); Console.WriteLine("Done!");
    fails as expected.
    LorenPechtel9 months ago
    Yeah, I forgot exactly where the evil was.
    neonsunset9 months ago
    The "evil" is what makes .NET scale with project complexity and dependency graph size - tasks are cheap and easy to spawn. You do not want to be beholden to a third party dependency that spawns a task that ends up throwing an exception somewhere crashing your entire application even if you don't care about it in the slightest.
    You can opt into unobserved task exceptions terminating the application if that's what you are looking for, and maybe subscribe to TaskScheduler.UnobservedTaskException event too: https://learn.microsoft.com/en-us/dotnet/api/system.threadin...
    Notably, this is an issue in Go where a package might spawn a goroutine with uncaught panic, like dereferencing a nil which is common, and you have no recourse to this at all. Perhaps it did historically make sense in Go, but it continues to bite people and requires more careful vetting of the dependencies. Moreover, in type-safe memory-safe languages uncaught exception might be a perfectly fine thing to ignore.
    When you fire and forget a task and it ends up throwing, GC will simply collect all the objects that no longer have GC roots, and the finally blocks will be ran, and finalizers will be called eventually on Gen2 GC if there are any - the standard library and most community abstractions that interact with manual memory management through interop or otherwise end up being watertight as a result of that.
throwawayha9 months ago
Great points.
But why do we invest so much complexity into outputting html/js/css.
- ketzo9 months ago
  Because html/js/css is the venue for a massive fraction of human-computer interactions, and there a lot of different things we want to accomplish between humans and computers.
  It’s always funny to me when people act like “websites” are some trivial, silly little area of software, when in fact for a lot of people, it’s their primary use of a computer.
- Traubenfuchs9 months ago
  https://en.wikipedia.org/wiki/Reductio_ad_absurdum
SomewhatLikely9 months ago
My first thought upon seeing the prompt:
```
    If you would build an in-memory cache, how would you do it?

    It should have good performance and be able to hold many entries. 
    Reads are more common than writes. I know how I would do it already, 
    but I’m curious about your approach.
```
Was to add this requirement since it comes up so often:
```
    Let's assume that keys accessed follow a power law, so some keys get 
    accessed very frequently and we would like them to have the fastest 
    retrieval of all.
```
I'm not sure if there are any efficient tweaks to hash tables or b-trees that might help with this additional requirement. Obviously we could make a hash table take way more space than needed to reduce collisions, but with a decent load factor is the answer to just swap frequently accessed keys to the beginning of their probe chain? How do we know it's frequently accessed? Count-Min sketch?
Even with that tweak, the hottest keys will still be scattered around memory. Wouldn't it be best if their entries could fit into fewer pages? So, maybe a much smaller "hot" table containing say the 1,000 most accessed keys. We still want a high load factor to maximize the use of cache pages so perhaps perfect hashing?
- mannyv9 months ago
  I've been doing this so long that my first thought was "use redis."
  Why?
  * it works
  * it's available now
  * it scales
  * it's capable of HA
  * it has bindings for every language you probably want to use
  Why bother writing your own cache, unless it's for an exercise? Cache management is complicated and error prone. Unless the roundtrip kills you just use redis (or memcached).
- NovaX9 months ago
  In a typical LRU cache every read is a write in order to maintain access order. If this is a concurrent cache then those mutations would cause contention, as the skewed access distribution leads to serializing threads on atomic operations trying to maintain this ordering. The way concurrent caches work is by avoiding this work because popular items will be reordered more often, e.g. sample the requests into lossy ring buffers to replay those reorderings under a try-lock. This is what Java's Caffeine cache does for 940M reads/s using 16 thread (vs 2.3B/s for an unbounded map). At that point other system overhead, like network I/O, will dominate the profile so trying to rearrange the hash table to dynamically optimize the data layout for hot items seems unnecessary. As you suggest, one would probably be better served by using a SwissTable style approach to optimize the hash table data layout and instruction mix rather than muck with recency-aware structural adjustments.
  The fastest retrieval will be a cache hit, so really once the data structures are not the bottleneck then the focus should switch to the hit rates. That's where the Count-Min sketch, hill climbing, etc. come into play in the Java case. There's also memoization to avoid cache stampedes, efficient expiration (e.g. timing wheels), async reloads, and so on that can become important. Or if a dedicated cache server like memcached, one has to worry about fragmentation, minimizing wasted space (to maximizing usable capacity), efficient I/O, etc. because every cache server can saturating the network these days so the goals shifts towards reducing the operational cost with stable tail latencies. What "good performance" means is actually on a spectrum because one should optimize for overall system performance rather than any individual, narrow metric.
- bespoke_engnr9 months ago
  I think splay trees would be good for this: https://en.m.wikipedia.org/wiki/Splay_tree
- withinboredom9 months ago
  You should check out the FASTER paper from Microsoft. It specifically covers how to create a K/V log that spills to disk for older keys, but keeps recent keys in memory.
- akoboldfrying9 months ago
  https://en.m.wikipedia.org/wiki/Splay_tree
uzerfcwn9 months ago
It seems like the author had some very specific read and write pattern in mind when they designed for performance, but it's never explicitly stated. The problem setting only stated that "reads are more common than writes", but that's not really saying much when discussing performance. For example, a HTML server commonly has a small set of items that are most frequently read, and successive reads are not very strongly dependent. On the other hand, a PIM system may often get iterative reads correlated on some fuzzy search filter, which will be slow and thrash cache pretty badly if the system is optimized for different access patterns.
When designing software, you first need to nail down the requirements, which I didn't really find in TFA.
hamdouni9 months ago
My takeaways for a more general pov :
1. Make or buy
2. Release a MVP
3. Keep it simple
4. Prepare for the worst
5. Make it easy to tests
7. Benchmark, monitor, log...
BillLucky9 months ago
Simple but elegant design principles, recommended
u8_friedrich9 months ago
> It is much easier to add features to reliable software, than it is to add reliability to featureful software.
Not sure about this tbh. In a lot of cases yeah maybe. But when you are dealing with complicated business logic where there is a lot of bells and whistles required, building a simple reliable version can lead you into a naive implementation that might be reliable but very hard to extend, while making an unstable complicated thing can help you understand the pit falls and you can work back from there into something more reliable. So I think this depends very much on the context.
- jpc09 months ago
  How are you defining simple here?
  Simple in my mind has abstractions where they are needed which should naturally lead to easy to extend code.
9 months ago
undefined
- 9 months ago
  undefined
lincpa9 months ago
[dead]
9 months ago
undefined
9 months ago
undefined
ActionHank9 months ago
Quick mental exercise on this.
If someone posed this question to you in an interview and you used these principles, would you get the job?
Probably not.