The goal is to model structural fragility in service architectures — things like retry amplification, connection pool exhaustion, latency propagation, and dependency cascades.
To demonstrate it, I ran a simulation where a small 2% latency spike triggers cascading database connection pool exhaustion in a 20-service architecture, eventually collapsing the system without any servers actually failing.
The article includes the failure breakdown as well as the simulation used to produce the results.
I'm currently building the public simulation lab where engineers can design their own architectures and run failure simulations.
Would love feedback from people working on distributed systems, reliability engineering, or large-scale backend infrastructure.