When you write code you have the choice to do per process, per thread, or sequential.
The problem is that doing multiple tests in a shared space doesn't necessarily match the world in which this code is run.
Per process testing allows you to design a test that matches the usage of your codebase. Per thread already constrains you.
For example: we might elect to write a job as a process that runs on demand, and the library we use has a memory leak, but it can't be fixed in reasonable time. Since we write it as a process that gets restarted we manage to constrain the behavior.
Doing multiple tests in multiple threads might not work here as there is a shared space that is retained and isn't representative of real world usage.
Concurrency is a feature of your software that you need to code for. So if you have multiple things happening, then that should be part of your test harness.
The test harness forcing you to think of it isn't always a desirable trait.
That said, I have worked on a codebase where we discovered bugs because the tests were run in parallel, in a shared space.
For me that's a positive bonus. If it runs multiple times in parallel and works, it will work as a single instance deployed in a pod somewhere.
* Use multiple processes, but multiple tests per process as well.
* Randomly split and order the tests on every run, to encourage catching flakiness. Print the seed for this as part of the test results for reproducibility.
* Tag your tests a lot (this is one place where, as many languages provide, "test classes" or other grouping is very useful). Smoke tests should run before all other tests, and all run in one process (though still in random order). Known long-running tests should be tagged to use a dedicated process and mostly start early (longest first), except that a few cores should be reserved to work through the fast tests so they can fail early.
* If you need to kill a timed-out test even though other tests are still running in the same process - just kill the process anyway, and automatically run the other tests again.
* Have the harness provide fixtures like "this is a temporary directory, you don't have to worry about clearing it on failure", so tests don't have to worry about cleaning up if killed. Actually, why not just randomly kill a few tests regardless?
I wrote some more about tests here [1], but I'm not sure I'll update it any more because of Github's shitty 2FA-but-only-the-inconvenience-not-the-security requirement.
The one thing we've had to be aware of is that the execution model means there can sometimes be differences in behaviour between nextest and cargo test. Very occasionally there are tests that fail in cargo test but succeed in nextest due to better isolation. In practice this just means that we run cargo test in CI.
The behavior differences mean some projects (like wgpu, and nextest itself) only support nextest these days. There's also support for setup scripts which can be used to pre-seed databases and stuff.
I'm not actually clear what he means by 'test' to be honest, but I assume he means 'a single test function that can either pass or fail'
Eg in python (nose)
class TestSomething: def test_A(): ... def test_B(): ...
I'm assuming he means test_A. But why not run all of TestSomething in a process?
Honestly, I think the idea of having tests have shared state is bad to begin with (for things that truly matter, eg if the outcome of your test depends on the state of sys.modules, something else is horribly wrong), so I would never make this tradeoff to benefit a scenario that I never think should be done.
Even if we were being absolute purists, this still hasn't solved the problem, the second your process communicates with any other process (or server). And that problem seems largely unsolveable, short of mocking.
Basically, I'm not convinced this is a good tradeoff, because the idea of creating thousands and thousands of processes to run a test suite, even on linux, sounds like a bad idea. (And at work, would definitely be a bad idea, for performance reasons)
While enforcing no shared state in tests might be useful, that wouldn't be feasible in Rust without adding quite a lot of constraints that would be tough if not impossible to enforce in a drop-in replacement for cargo test. There's certainly room for alternatives in the testing ecosystem in Rust that don't try to maintain compatibility with the built-in test harness, but I don't think the intention of cargo nextest is to try to do that.
One other point that might not be obvious is that right now, there's no stable way to hook into Rust's libtest. The only options to provide an alternative testing harness in Rust are to either only support nightly rather than stable, break compatibility with tests written for the built-in test harness, or provide a separate harness that still supports existing tests. I'm sure there are arguments to be made for each of the other alternatives, but personally, I don't think there's any super realistic chance for adoption of anything that picks the first two options, so the approach cargo nextest is taking the most viable approach available (at least until libtest stabilizes, but it's not obvious when that will happen).
I assume so as well.
Unit testing in Rust is based around functions annotated with #[test], so it's safe to assume that when the author says "test" they are referring to one such function.
It's up to the user to decide what they do in each function. For example, you could do a Go-style table-driven test, but the entire function would be a single "test", _not_ one "test" per table entry.
> Honestly, I think the idea of having tests have shared state is bad to begin with (for things that truly matter, eg if the outcome of your test depends on the state of sys.modules, something else is horribly wrong), so I would never make this tradeoff to benefit a scenario that I never think should be done.
I don't disagree as a matter of principle, but the reality really is different. Some of the first nextest users outside of myself and my workplace were graphical libraries and engines.
> Basically, I'm not convinced this is a good tradeoff, because the idea of creating thousands and thousands of processes to run a test suite, even on linux, sounds like a bad idea. (And at work, would definitely be a bad idea, for performance reasons)
With Rust or other native languages it really is quite fast. With Python I agree not as much, so this tradeoff wouldn't make as much sense there yes.
But note that things like test cancellation are a little easier to do in an interpreted model.
I blame this partially on our notions of code reuse. We conflate it with several other things, and in the case of tests we conflate it with state reuse.
And the availability of state reuse leads to people writing fakes when they should be using mocks, and people not being able to tell the difference between mocks and fakes and thus being incapable to have a rational discussion about them.
To my thinking, and the thinking of pretty much all of the test experts I’ve called Mentor (or even Brother), beforeEaches should be repeatable. Add a test, it repeats one more time. Delete a test, one less. And if they’re repeatable, they don’t have to share the same heap. One heap is as good as another.
Lots of languages can only do that segregation at the process level. In NodeJS it would be isolates (workers). If you’re very careful about global state you could do it per thread. But that doesn’t happen very often because “you” includes library writers, language designers, and your coworker Steve who is addicted to in-memory caching. I can say, “don’t be Steve” until I’m blue in the face but nearly every team hires at least one Steve, and some are rotten with them.
That is especially good for bare metal. If you don't have global allocator, have limitted ram, etc., you might not be able to write the test harness as part of the guest program at all! so you want want to move as much logic to the host program as possible, and then run as little as a few instructions (!) in the guess program.
See https://github.com/gz/rust-x86 for an example of doing some of this.
Measured by weekly downloads (around 120k a week total last I checked), Windows is actually the number two platform nextest is used on, ahead of macOS. It's mostly CI, but clearly a lot of people are getting value out of nextest on Windows.
Is "memory corruption" an issue with Rust? Also, if one test segfaults, isn't it a reason to halt the run because something got seriously broken?
You can cause memory corruption if you opt out of memory safety guarantees by using Unsafe Rust.
https://doc.rust-lang.org/book/ch19-01-unsafe-rust.html
Sometimes unsafe is necessary and the idea then is that the “dangerous” parts of the code remain isolated in explicitly marked “unsafe” blocks, where it can be closely reviewed.
Also, even if your own Rust code is doing nothing unsafe you might be using external libraries written in other languages and things might go wrong.
> if one test segfaults, isn't it a reason to halt the run because something got seriously broken?
Sometimes it’s still interesting and helpful to continue running other tests even if one fail. If several of them fail it might even help you pinpoint exactly what’s going wrong than just a single failure might. (Although having a bunch of failing tests can also be more noise.)
https://hacks.mozilla.org/2019/11/announcing-the-bytecode-al...
For tests specifically, some considerations I found to be missing:
- Given speed requirements for tests, and representativeness requirements, it's often beneficial to refrain from too much isolation so that multiple tests can use/excercise paths that use pre-primed in memory state (caches, open sockets, etc.). It's odd that the article calls out that global-ish state mutation as a specific benefit of process isolation, given that it's often substantially faster and more representative of real production environments to run tests in the presence of already-primed global state. Other commenters have pointed this out.
- I wish the article were clearer about threads as an alternative isolation mechanism for sequential tests versus threads as a means of parallelizing tests. If tests really do need to be run in parallel, processes are indeed the way to go in many cases, since thread-parallel tests often test a more stringent requirement than production would. Consider, for example, a global connection pool which is primed sequentially on webserver start, before the webserver begins (maybe parallel) request servicing. That setup code doesn't need to be thread-safe, so using threads to test it in parallel may surface concurrency issues that are not realistic.
- On the other hand, enough benefits are lost when running clean-slate test-per-process that it's sometimes more appropriate to have the test harness orchestrate a series of parallel executors and schedule multiple tests to each one. Many testing frameworks support this on other platforms; I'm not as sure about Rust--my testing needs tend to be very simple (and, shamefully, my coverage of fragile code lower than it should be), so take this with a grain of salt.
- Many testing scenarios want to abort testing on the first failure, in which case processes vs. threads is largely moot. If you run your tests with a thread or otherwise-backgrounded routine that can observe a timeout, it doesn't matter whether your test harness can reliably kill the test and keep going; aborting the entire test harness (including all processes/threads involved) is sufficient in those cases.
- Debugging tools are often friendlier to in-process test code. It's usually possible to get debuggers to understand process-based test harnesses, but this isn't usually set up by default. If you want to breakpoint/debug during testing, running your tests in-process and on the main thread (with a background thread aborting the harness or auto-starting a debugger on timeout) is generally the most debugger-friendly practice. This is true on most platforms, not just Rust.
- fork() is a middle ground here as well, which can be slow, though mitigations exist, but can also speed things up considerably by sharing e.g. primed in-memory caches and socket state to tests when they run. Given fork()'s sharp edges re: filehandle sharing, this, too, works best with sequential rather than parallel test execution. Depending on the libraries in use in code-under-test, though, this is often more trouble than it's worth. Dealing with a mixture of fork-aware and fork-unaware code is miserable; better to do as the article suggests if you find yourself in that situation. How to set up library/reusable code to hit the right balance between fork-awareness/fork-safety and environment-agnosticism is a big and complicated question with no easy answers (and also excludes the easy rejoinder of "fork is obsolete/bad/harmful; don't bother supporting it and don't use it, just read Baumann et. al!").
- In many ways, this article makes a good case for something it doesn't explicitly mention: a means of annotating/interrogating in-memory global state, like caches/lazy_static/connections, used by code under test. With such an annotation, it's relatively easy to let invocations of the test harness choose how they want to work: reuse a process for testing and re-set global state before each test, have the harness itself (rather than tests by side-effect) set up the global state, run each test with and/or without pre-primed global state and see if behavior differs, etc. Annotating such global state interactions isn't trivial, though, if third-party code is in the mix. A robust combination of annotations in first-party code and a clear place to manually observe/prime/reset-if-possible state that isn't annotated is a good harness feature to strive for. Even if you don't get 100% of the way there, incremental progress in this direction yields considerable rewards.
The post lists out what it would take to make most of nextest's feature set available in a shared-process model. There has been some interest in this, but it is a lot of work for things that come for free.
I do try and present a decent level of detail in the post.