> Once my property tests were running overnight
This was happening because I was setting my test runner to say "Keep generating random data and running tests for 8 hours, no matter how many tests that is", not because the tests were slow. I was running literally millions of tests overnight (usually about 2-3 million), though later when I added some slow tests that number went down into the tens of thousands of tests overnight.
I may have given the impression that property-based testing is slow. It doesn't have to be. It can be very, very quick. But the more random tests you run, the more likely to are to uncover the really, REALLY bizarre edge cases.
We had some code that used a square root, and in some very rare circumstances, we could get a negative number, which would throw an exception. I don't think i would have even considered that possibility if FsCheck hadn't generated it.
Quickly addressing the time it takes to run property-based tests, especially in Python: it's extremely helpful to make sure that running subsets of unit tests is convenient and documented and all your developers are doing it. Otherwise they may revolt against property-based testing because of the time it takes to run tests. Again, especially in Python.
The drawback is that you might get used to saying "Well, of course the tests failed, I'm in the middle of refactoring and haven't finished yet" and ignoring the failed-test notifications. But just like how I'm used to ignoring typecheck errors until I finish typing but then I look at them and see if they're still there, you probably won't get used to ignoring test failures all the time. Though having a five-minute lag time between "I'm finished typing, now those test errors should go away" and having them actually go away might be disconcerting.
Sadly I have a really hard time getting teammates to agree to using property-based testing - or letting me use it - because they take "no non-deterministic tests" as ironclad dogma without really understanding the the principle's real intent.
(I can do it to find edge cases to convert to deterministic unit tests in the privacy of my own home, of course. But not being able to commit a library of useful strategies without people calling foul in code review is obnoxious.)
These tests can be made determenistic. Shrinking to the minimal reproducing case, is always deterministic. Also you can [re]start the tests with the specific seed - you typically log the seed, so you can replay the exact failure deterministically - prop-based tests are stochastic by design but reproducible by capture.
For my implementation, I was using nodes of size 4 rather than size 32, because that let me test much easier; then after testing, I would switch the branching constant to 32 and run final tests. With a branching factor of 4, that means that a tree of size 24 contained one root node with two children, a left child that was a full tree of 16 (a tree node pointing to four leaf nodes, all of them full of 4 items), and a right child that was a leaf node of size 4. That makes 20 items in the tree, plus a full tail array of 4 making 24 items. The failure came when the next item was appended.
Now, if you've never implemented an RRB tree you won't have noticed the mistake. But in the previous paragraph, when I said that the right child of the root was a leaf? That was the mistake; in an RRB tree all leaves are supposed to be at the same level, and the right child of the root should have been a one-child node, whose child was a leaf of size 4. My code that promoted the root up one level when it was full (when the tree was size 20, a full 16-item tree plus a tail of 4) was faulty, and had created a new root with the previous root as the left child and the previous tail as the right child, instead of creating a one-child node as the right child of the new root with the 4-item leaf as its child. The tree still functioned for all read-only purposes (traversal, lookups, etc) and even for appends, as long as the appends did no more than fill the tail. Once the tail was full (making a total of 24 items total stored in the structure) and I needed to append a 25th item, the tail-promotion code found an illegal tree state and did the wrong thing. I no longer remember which wrong thing it did, just that once that 25th item was appended, the resulting tree no longer passed all the property checks.
* Constant for all practical purposes. In fact it's O(log32 N), but since log32 N is no more than 7 when N is 2^32, the operation is bounded by a constant. So it's O(1) for all practical purposes: most lists of any reasonable size will translate into trees whose height is no more than 2 or 3, which means only 2 or 3 nodes will need to be split when splitting the entire list (and the RRB tree structure that models the list) into two parts. And likewise for joining two RRB structures together (a list append operation): there might be up to O(H) merged nodes where H is the height of the tree, which again no more than log32 N for a tree of branching factor 32.
So rather than try to learn to black boxes at the same time , I fall back to "several more unit tests to document more edge cases to defensibly guard against"
Is there some simple way to describe this defensive programming iteration pattern in Hypothesis? Normally we just null-check and return early and have to deal with the early-return case. How do I quickly write property tests to check that my code handles the most obvious edge cases?
> [...] time to learn a DSL for describing all possible inputs and outputs when I already had an existing function [...]
You don't have to describe all possible inputs and outputs. Even just being able to describe some classes of inputs can be useful.
As a really simple example: many example-based tests have some values that are arbitrary and the test shouldn't care about them, like eg employees names when you are populating a database or whatever. Instead of just hard-coding 'foo' and 'bar', you can have hypothesis create arbitrary values there.
Just like learning how to write (unit) testable code is a skill that needs to be learned, learning how to write property-testable code is also a skill that needs practice.
What's less obvious: retro-fitting property-based tests on an exiting codebase with existing example-based tests is almost a separate skill. It's harder than writing your code with property based tests in mind.
---
Some common properties to test:
* Your code doesn't crash on random inputs (or only throws a short whitelist of allowed exceptions).
* Applying a specific functionality should be idempotent, ie doing that operation multiple times should give the same results as applying it only once.
* Order of input doesn't matter (for some functionality)
* Testing your prod implementation against a simpler implementation, that's perhaps too slow for prod or only works on a restricted subset of the real problem. The reference implementation doesn't even have to be simpler: just having a different approach is often enough.
The randomness can bite a little if that test failure happens on an unrelated branch, but it’s not much different to someone just discovering a bug.
edit - here's the relevant part of the hypothesis guide https://hypothesis.readthedocs.io/en/latest/tutorial/replayi...
You shouldn't think of Hypothesis as a random input generator but as an abstraction over thinking about the input cases. It's not perfect: you'll often need to .map() to get the distribution to reflect the usage of the interface being tested and that requires some knowledge of the shrinking behaviour. However, I was really surprised how easy it was to use.
So it might as well just throw the kitchen sink at the function, if it handles that: Great, if not: That string will get narrowed down until you arrive at a minimal set of failing inputs.
If you just naively treat it as a string and let hypothesis generate values, sure. Which is better than if you are doing traditional explicit unit testing and haven’t explicitly defined apostrophes as a concern.
If you do have it (or special characters more generally) as a concern, that changes how you specify your test.
Now, as for what would actually happen in that situation, when using Hypothesis:
- As others have indicated, Hypothesis keeps a database of the failures it's found. Committing that database to your project repo would act like a ratchet: once they've successfully found a problem, they will continue to do so (until it's fixed).
- If you don't want to commit your database (to avoid churn/conflicts), then the same would happen per working-copy. This is fine for dev machines, where repo directories can live for a long time; though not so useful for CI, if the working-copy gets nuked after each run.
- Even if we disable the failure database, property failures will always show the inputs which caused it (the "counterexample"), along with the random seed used to generate it. Either of those is enough to reproduce that exact run, giving us a ready-made regression test we can copy/paste to make the failure permanent (until it's fixed). As others have said, Hypothesis properties can be decorated with `@example(foo)` to ensure a particular input is always checked.
- Even if we disable the failure database, and don't copy the counterexample as a regression test, property checkers like Hypothesis will still "shrink" any counterexamples they find. Your example of apostrophes causing problems is trivial for Hypothesis to shrink, so if/when it happens to find some counterexample, it will always manage to shrink that down to the string "'"; essentially telling us that "employee names fail on apostrophe", which we could either fix right now, or stick in our bug tracker, or whatever.
In the former, hypothesis and other similar frameworks are deterministic and will replay the failing test on request or remember the failing tests in a file to rerun in the future to catch regressions.
In the latter, you just tell the framework to not generate such values or at least to skip those test cases (better to not generate in terms of testing performance).
I think the answer to this is, in practice, it will not fail to generate such input. My understanding is that it's pretty good at mutating input to cover a large amount of surface area with as few as possible examples.
But by default you also start with a new random seed every time you run the tests, so you can build up more confidence over the older tests and older code, even if you haven't done anything specifically to address this problem.
Also, even with Hypothesis you can and should still write specific tests or even just specific generators to cover specific classes of corners cases you are worried about in more detail.
Is it common practice to use the same seed and run a ton of tests until you're satisfied it tested it thoroughly?
Because I think I would prefer that. With non-deterministic tests I would always wonder if it's going to fail randomly after the code is already in production.
Running all of these tests would take too long. So instead we take a finite sample from our universe, and only run these.
> With non-deterministic tests I would always wonder if it's going to fail randomly after the code is already in production.
You could always take the same sample, of course. But that means you only ever explore a very small fraction of that universe. So it's more likely you miss something. Remember: closing your eyes doesn't make the tiger go away.
If there are important cases you want to have checked every time, you can use the @example decorator in Hypothesis, or you can just write a traditional example based test.
if you didn't use property-based testing, what are the odds you would've thought of the case?
https://fsharpforfunandprofit.com/series/property-based-test...
A more complex kind of PBT is if you have two implementations of an algorithm or data structure, one that's fast but tricky and the other one slow but easy to verify. (Say, quick sort vs bubble sort.) Generate data or operations randomly and ensure the results are the same.
Testing that f(g(x)) == x for all x and some f and g that are supposed to be inverses of each other is a good test, but it's probably not the simplest.
The absolute simplest I can think of is just running your functionality on some randomly generated input and seeing that it doesn't crash unexpectedly.
For things like sorting, testing against an oracle is great. But even when you don't have an oracle, there's lots of other possibilities:
* Test that sorting twice has the same effect as sorting once.
* Start with a known already in-order input like [1, 2, 3, ..., n]; shuffle it, and then check that your sorting algorithm re-creates the original.
* Check that the output of your sorting algorithm is in-order.
* Check that input and output of your sorting algorithm have the same elements in the same multiplicity. (If you don't already have a datastructure / algorithm that does this efficiently, you can probe it with more randomness: create a random input (say a list of numbers), pick a random number X, count how many times X appears in your list (via a linear scan); then check that you get the same count after sorting.
* Check that permuting your input doesn't make a difference.
* Etc.
I got started out of curiosity, and because writing property based tests is a lot more fun than writing example based tests.
For this use case, we've found it best to just use a fuzzer, and work off the tracebacks.
That being said, we have used hypothesis to test data validation and normalizing code to decent success. We use on a one-off basis, when starting something new or making a big change. We don't run these tests everyday.
Also, I don't like how hypothesis integrates much better with pytest than unittest.
Yes, if that was the only property you want to test, a dedicated fuzzer is probably better.
But if you are writing a bunch of property based tests, you can add a basic "this doesn't crash in unexpected ways" property nearly for free.
So if you already have parametrized tests, you're already halfway there.
...and https://github.com/magic-wormhole/magic-wormhole/blob/1b4732...
The simplest ones to get started with are "strings", IMO, and also gives you lots of mileage (because it'll definitely test some weird unicode). So, somewhere in your API where you take some user-entered strings -- even something "open ended" like "a name" -- you can make use of Hypothesis to try a few things. This has definitely uncovered unicode bugs for me.
Some more complex things can be made with some custom strategies. The most-Hypothesis-heavy tests I've personally worked with are from Magic Folder strategies: https://github.com/tahoe-lafs/magic-folder/blob/main/src/mag...
The only real downside is that a Hypothesis-heavy test-suite like the above can take a while to run (but you can instruct it to only produce one example per test). Obviously, one example per test won't catch everything, but is way faster when developing and Hypothesis remembers "bad" examples so if you occasionally do a longer run it'll remember things that caused errors before.
Whenever I find myself thinking "WTF? Surely ABC does XYZ?", and the code for ABC isn't immediately obvious, then I'll bang-out an `ABC_does_XYZ` property and see if I'm wrong. This can be much faster than trying to think up "good" examples to check, especially when I'm not familiar with the domain model, and the relevant values would be giant nested things. I'll let the computer have a go first.
If you find yourself writing several edge cases manually with a common test logic, I think the @example decorator in Hypothesis is a quick way to do that: https://hypothesis.readthedocs.io/en/latest/reference/api.ht...
Appreciate you taking the time to answer.
That not only tests your code, but also exercises your just written input generators.
I started off TDD covered the basics. Learned what I needed to learn about the files I'm dealing with, edge cases, sqlglot and then I moved onto Hypothesis for extra confidence.
I'm curious to see if it'll help with commands for APIs. I nothing else it'll help me appreciate how liberal my API is when perhaps I don't want it to be?
YMMV, but once I first set up Schemathesis on one of my MVPs, it took several hours to iron out the kinks. But thereafter, if built into CI, I've found it guards against regression quite effectively.
This looks really good. I like using hypothesis for APIs so this is a really nice layer on top, thanks for pointing this one out.
also pretty happy to see the reports of thousands or api calls that matched the expected responses, that increases the confidence you have in your system
Essentially, you're testing that "my_sort" returns the same as python's standard "sort". Of course, this means you need a second function that acts the same as the function you wrote. In real life, if you had that you probably wouldn't have written the function "my_sort" at all.
Obviously it's just a tiny example, but in general I often find writing a test that will cover every hypothetical case isn't realistic.
Anyone here use this testing in the wild? Where's it most useful? Do you have the issue I described? Is there an easy way to overcome it?
- that the output sequence is the same length as the input
- that the output sequence is sorted
- that the population counts are the same in the input and the output
> In real life, if you had that you probably wouldn't have written the function "my_sort" at all.
Having a generalised sort doesn't mean you can't write a more specialised one which better fits your data set e.g. you might be in a situation where a radix sort is more appropriate.
The opposite might also be true. Suppose you already have a specialised implementation, and you want to write a new generalised one. You can still test them against each other.
Eg suppose you are writing a library that supports sorting crazy huge datasets that don't fit into memory. You can still check that it gives the same answers as the built-in sorting algorithm from the standard library on input that's small enough to fit into memory.
1. d(x, x) = 0 for all x 2. d(x, y) >= 0 for all x, y 3. d(x, y) = d(y, x) for all x, y 4. d(x, z) <= d(x, y) + d(y, z) for all x, y, z (the triangle inequality)
Especially the triangle inequality check weeded out some tricky corner cases I probably wouldn't have figured out on my own. Some will object that you're not guaranteed to find bugs with this kind of random-generation strategy, but if you blast through a few thousand cases every time you run your test suite, and the odd overnight run testing a few million, you quickly get fairly confident that the properties you test actually hold. Of course any counterexamples the PBT finds should get lifted to regression tests in addition, to make sure they're always caught if they crop up again. And as with any testing approach, there are no guarantees, but it adds a nice layer of defense in depth IMO.
One example would be when you have a naive implementation of some algorithm and you want to introduce a second one, optimized but with much more complex implementation. Then this naive one will act as a model for comparisons.
Another case that comes to mind is when you have rather simple properties to test (like: does it finish without a crash, within a given time?, does not cross some boundaries on the output?), and want to easily run over a sensible set of varying inputs.
You don't need that, that’s just a convenient shortcut when you do have a function that does that (but, I would argue, a not-great example for precisely that reason.)
All you actually need is the ability to verify the property (or set of properties) that you are testing in the output. You could replace the test in that sample with something like this, which I think is a better illustration of what is going on (this is somewhat simplified, rather than the length test you really want to test that it has the same set:
@given(st.lists(st.integers() | st.floats()))
def test_sort_correct(lst):
result = my_sort(lst)
# result has the same unique items as original list
assert set(result) | set(lst)
# for each item, result has the same number of copies as original
assert all(
result.count(item) == lst.count(item)
for item in set(result)
)
# result is sorted
assert all(result[n] <= result[n+1] for n in range(len(result)-1))Add in checking each item is less than or equal to its successor and you have the fundamental sort properties. You might have more, like stability.
Do we? You can pop-count the two lists and checks that those are equal.
I'd just changed a data structure with three components, and made sure the test suite was still passing. I happened to notice a function in the same file, for parsing strings into that data structure, which had a docstring saying that it ignores whitespace at the start/end of the string, and in-between the components. It had tests for the happy-path, like "foo123x" -> ("foo", 123, 'x'), as well as checking that optional components could be left out, and it even checked some failure cases. Yet none of the tests used any whitespace.
I thought it would be good to test that, given that somebody had gone to the effort of documenting it. Yet I didn't want to write a bunch of combination like " foo123x", "foo 123x", "foo123 x", " foo 123x", "foo 123 x", and so on. Instead, I wrote a property which adds some amount of whitespace (possibly none) to each of those places, and assert that it gets the same result as with no whitespace (regardless of whether it's a successful parse or not). I wasn't using Python, but it was something like this:
def whitespace_is_ignored(b1: bool, b2: bool, b3: bool, s1: int, s2: int, s3: int, s4: int):
v1 = "foo" if b1 else ""
v2 = "123" if b2 else ""
v3 = "x" if b3 else ""
spaces = lambda n: " " * n
spaced = "".join([spaces(s1), v1, spaces(s2), v2, spaces(s3), v3, spaces(s4)])
assert parser(v1 + v2 + v3) == parser(spaced)
The property-checker immediately found that "foo123x " (with two spaces at the end) will fail to parse. When I fixed that, it found that spaces after the first component will end up in the result, like "foo 123x" -> ("foo ", 123, 'x').Of course, we could make this property more general (e.g. by taking the components as inputs, instead of hard-coding those particular values); but this was really quick to write, and managed to find multiple issues!
If I had written a bunch of explicit examples instead, then it's pretty likely I would have found the "foo 123x" issue; but I don't think I would have bothered writing combinations with multiple consecutive spaces, and hence would not have found the "foo123x " issue.
> this means you need a second function that acts the same as the function you wrote.
Actually no, you need to check that outcomes match certain properties. The example there works for a reimplementation (for all inputs, the output of the existing function and new one must be identical) but you can break down a sorting function into other properties.
Here are some for a sorting function that sorts list X into list Y.
Lists X and Y should be the same length.
All of the elements in list X should be in list Y.
Y[n] <= Y[n+m] for n+m<len(X)
If you support a reverse parameter, iterating backwards over the sorted list should match the reverse sorted list.
Sorting functions are a good simple case but we don't write them a lot and so it can be hard to take that idea and implement it elsewhere.
Here's another example:
In the app focus ELEMENT A. Press tab X times. Press shift-tab X times. Focus should be be on ELEMENT A.
In an app with two or more elements, focus ELEMENT A. Press tab. The focus should have changed.
In an app with N elements. Focus ELEMENT A. Press tab N times. The focus should be on ELEMENT A.
Those are a simple properties that should be true, are easy to test and maybe you'd have some specific set of tests for certain cases and numbers but there's a bigger chance you then miss that focus gets trapped in some part of your UI.
I had a library for making UIs for TVs that could be shaped in any way (so a carousel with 2 vertical elements then one larger one, then two again for example). I had a test that for a UI, if you press a direction and the focus changes, then pressing the other direction you should go back to where you were. Extending that, I had that same test run but combined with "for any sequence of UI construction API calls...". This single test found a lot of bugs in edge cases. But it was a very simple statement about using the navigation.
I had a similar test - no matter what API calls are made to change the UI, either there are no elements on screen, or one and only one has focus. This was defined in the spec, but we also had defined in the spec that if there was focus, removing everything and then adding in items meant nothing should have focus. These parts of the spec were independently tested with unit tests, and nobody spotted the contradiction until we added a single test that checked the first part automatically.
This I find exceptionally powerful when you have APIs. For any way people might use them, some basic things hold true.
I've had cases where we were identifying parts of text and things like "given a random string, adding ' thethingtofind ' results in getting at least a match for ' thethingtofind ', with start and end points, and when we extract that part of the string we get ' thethingtofind '". Sounds basic, right? Almost pointless to add when you've seen the output work just fine and you have explicit test cases for this? It failed. At one step we lowercased things for normalisation and we took the positions from this lowercased string. If there was a Turkish I in the string, when lowercased it became two characters:
>>> len("İ") == len("İ".lower())
False
So anything where this appeared before the match string would throw off the positions.Have a look at your tests, I imagine you have some tests like
"Doing X results in Y" and "Doing X twice results in Y happening once". This is two tests expressing "Doing X n times where n>=1, Y happens". This is much more explicitly describing the actual property you want to test as well.
You probably have some tests where you parameterise a set of inputs but actually check the same output - those could be randomly generated inputs testing a wider variety of cases.
I'd also suggest that if you have a library or application and you cannot give a general statement that always holds true, it's probably quite hard to use.
Worst case, point hypothesis at a HTTP API, have it auto generate some tests that say "no matter what I send to this it doesn't return a 500".
In summary (and no this isn't LLM generated):
1. Don't test every case.
2. Don't check the exact output, check something about the output.
3. Ask where you have general things that are true. How would you explain something to a user without a very specific setup (you can't do X more than once, you always get back Y, it's never more than you put in Z...)
4. See if you have numbers in tests where the specific number isn't relevant.
5. See if you have inputs in your tests where the input is a developer putting random things in (sets of short passwords, whatever)
One large testing loop is testing CL's compile function. This involves creating random valid lambda expressions (unnamed functions), compiling them, and running them on random input. The properties tested are things like "the compiler doesn't crash, the generated code doesn't crash, different versions of the code with different valid declarations compute the same thing".
More specific properties are tested for specific CL functions. For example, CL has a function subtypep which, given two types, says either the first value is or is not a subtype of the second, or that it couldn't figure it out. One can test this on randomly generated types, evaluating properties like (subtypep T1 T2) should be consistent with (subtypep `(not ,T2) `(not ,T1)).
The point of PBT isn't that you write tests to cover every possible case, it's that PBT will cook up tests for all sorts of cases you will never come up with yourself. I have repeatedly found it kicking out the whackiest test inputs that happen to trigger bugs from strange interactions.
A sort itself is a good example though. You can do much better then just compare to an existing sort implementation. Because you can easily check if an array is sorted with a linear scan. If your sort is stable you can do an additional check that for all pairs of adjacent elements that compare as equal the order of them is the same as in the input array.
I’ve never use pbt and failed to find a new bug. I recommended it in a job interview, they used it and discovered a pretty clear bug on their first test. It’s really powerful.
Why is it not more popular?
My theory is that only code written in functional languages has complex properties you can actually test.
In imperative programs, you might have a few utils that are appropriate for property testing - things like to_title_case(str) - but the bulk of program logic can only be tested imperatively with extensive mocking.
1. Lots of devs just don't know about it.
2. It's easy to do it wrong and try and reimplement the thing you're testing.
3. It's easy to try and redo all your testing like that and fail and then give up.
Here's some I've used:
Tabbing changes focus for UIs with more than 1 element. Shift tabbing the same number of times takes you back to where you came from.
This one on TVs with u/d/l/r nav -> if pressing a direction changes focus, pressing the opposite direction takes you back to where you came from.
An extension of the last -> regardless of the set of API calls used to make the user interface, the same is true.
When finding text ABC in a larger string and getting back `Match(x, start, end)`, if I take the string and chop out string[start:end] then I get back exactly ABC. This failed because of a dotted I that when lowercased during a normalisation step resulted in two characters - so all the positions were shifted. Hypothesis found this and was able to give me a test like "find x in 'İx' -> fail".
No input to the API should result in a 500 error. N, where N>0, PUT requests result in one item created.
Clicking around the application should not result in a 404 page or error page.
Overall I think there's lots of wider things you can check, because we should have UIs and tools that give simple rules and guarantees to users.
Most apps (at least in my part of the world) these days are absolutely peppered with side effects. At work our code is mostly just endpoints that trigger tons of side effects, then return some glob of data returned from some of those effects. The joys of micro services!!
If you’re designing from the ground up with testing in mind, you can make things super testable. Defer the actual execution of side effects. Group them together and move local biz logic to a pure function. But when you have a service that’s just a 10,000 line tangle of reading and writing to queues, databases and other services, it’s really hard to ANY kind of testing.
I think that’s why unit testing and full on browser based E2E testing are popular these days. Unit testing pretends the complexity isn’t there via mocks, and lets you get high test coverage to pass your 85% coverage requirement. Then the E2E tests actually test user stories.
I’m really hoping there’s a shift. There are SO many interesting and comprehensive testing strategies available that can give you such high confidence in your program. But it mostly feels like an afterthought. My job has 90% coverage requirements, but not a single person writes useful tests. We have like 10,000 unit tests literally just mocking functions and then spying on the mocked return.
For anybody wanting to see a super duper interesting use of property based testing, check out “Breaking the Bank with test contract”, a talk by Allen Rohner. He pretty much uses property based testing to verify that mocks of services behave identically to the actual services (for the purpose of the program) so that you can develop and test against those mocks. I’ve started implementing a shitty version of this at work, and it’s wicked cool!!
The Python survey data (https://lp.jetbrains.com/python-developers-survey-2024/) holds pretty consistently at 4% of Python users saying they use it, which isn't as large as I'd like, but given that only 64% of people in the survey say they use testing at all isn't doing too badly, and I think certainly falsifies the claim that Python programs don't have properties you can test.
One drawback I see is that property-based tests inevitably need to be much more complex than example-based ones. This means that bugs are much more likely, they're more difficult to maintain, etc. You do mention that it's a lot of code, but I wonder if the complexity is worth it in the long run. I suppose that since testing these scenarios any other way would be even more tedious and error-prone, the answer is "yes". But it's something to keep in mind.
I don’t think that’s true, I just think the complexity is more explicit (in code) rather than implicit (in the process of coming up with examples). Example-based testing usually involves defining conditions and properties to be tested, then involves constructing sets of examples to test them and which attempt to anticipated edge cases from the description of the requirements (black box) or from knowledge of how the code is implemented (white box).
Property based testing involves defining the conditiosn and properties, writing code that generates the conditions and for each property writing a bit of code that can refute it by passing if and only if it is true of the subject under test for a particular set of inputs.
With a library like Hypothesis which both has good generators for basic types and good abstractions for combining and adapting generators, the latter seems to be less complex overall, as well as moving the complexity into a form where it is explicit and easy to maintain/adapt, whereas adapting example-based tests to requirements changes involves either throwing out examples and starting over or revalidating and updating examples individually.
You're downplaying the amount of code required to properly setup a property-based test. In the linked article, the author implemented a state machine to accurately model the SUT. While this is not the most complex of systems, it is far from trivial, and certainly not a "bit of code". In my many years of example-based unit/integration/E2E testing, I've never had to implement something like that. The author admits that the team was reluctant to adopt PBT partly because of the amount of code.
This isn't to say that example-based tests are simple. There can be a lot of setup, mocking, stubbing, and helper code to support the test, but this is usually a smell that something is not right. Whereas with PBT it seems inevitable in some situations.
But then again, I can see how such tests can be invaluable, very difficult and likely more complex to implement otherwise. So, as with many things, it's a tradeoff. I think PBT doesn't replace EBT, nor vice-versa, but they complement eachother.
Since it came up in another thread (yes, it's trivial), a function `add` is no easier or harder to test with examples than with PBT, here are some of the tests as both PBT-style and example-based style:
@given(st.integers())
def test_left_identity_pbt(a):
assert add(a, 0) == a
def test_left_identity():
assert add(10, 0) == 10
@given(st.integers(), st.integers())
def test_commutative(a, b):
assert add(a, b) == add(b, a)
@parametrize("a,b", examples)
def test_commutative():
assert add(a, b) == add(b, a)
They're the same test, but one is more comprehensive than the other. And you can use them together. Supposing you do find an error, you add it to your example-based tests to build out your regression test suite. This is how I try to get people into PBT in the first place, just take your existing example-based tests and build a generator. If they start failing, that means your examples weren't sufficiently comprehensive (not surprising). Because PBT systems like Hypothesis run so many tests, though, you may need to either restrict the number of generated examples for performance reason or breakup complex tests into a set of smaller, but faster running, tests to get the benefit.Other things become much simpler, or at least simpler to test comprehensively, like stateful and end-to-end tests (assuming you have a way to programmatically control your system). Real-world, I used Hypothesis to drive an application by sending a series of commands/queries and seeing how it behaved. There are so many possible sequences that manually developing a useful set of end-to-end tests is non-trivial. However, with Hypothesis it just generated sequences of interactions for me and found errors in the system. After each command (which may or may not change the application state) it issued queries in the invariant checks and verified the results against the model. Like with example-based testing, these can be turned into hard-coded examples in your regression test suite.
Come on, that example is practically useless for comparing both approaches.
Take a look at the article linked above. The amount of non-trivial code required to setup a PBT should raise an eyebrow, at the very least.
It's quite possible that the value of such a test outweighs the complexity overhead, and that implementing all the test variations with EBT would be infeasible, but choosing one strategy over the other should be a conscious decision made by the team.
So as much as you're painting PBT in a positive light, I don't see it that clearly. I think that PBT covers certain scenarios better than EBT, while EBT can be sufficient for a wide variety of tests, and be simpler overall.
But again, I haven't actually written PBTs myself. I'm just going by the docs and articles mentioned here.
Come on, I admitted it was trivial. It was a quick example that fit into a comment block. Did you expect a dissertation?
> that implementing all the test variations with EBT would be infeasible
That's kind of the point to my previous comment. PBTs will generate many more examples than you would create by hand. If you have EBTs already, you're one step away from PBTs (the generators, I never said this was trivial just to preempt another annoying "Come on"). And then you'll have more comprehensive testing than you would have had sticking to just your carefully handcrafted examples. This isn't the end of property-based testing, but it's a really good start and the easiest way to bring it into an existing project because you can mostly reuse the existing tests.
Extending this, once you get used to it, to stateful testing (which many PBT libraries support, including Hypothesis) you can generate a lot of very useful end-to-end tests that would be even harder to come up with by hand. And again, if you have any example-based end-to-end tests or integration tests, you need to come up with generators and you can start converting them into property-based tests.
> but choosing one strategy over the other should be a conscious decision made by the team.
Ok. What prompted this? I never said otherwise. It's also not an either/or situation, which you seem to want to make it. As I wrote in that previous comment, you can use both and use the property-based tests to bolster the example-based tests, and convert counterexamples into more example-based tests for your regression suite.
> I haven't actually written PBTs myself.
Huh.
Probably because it's difficult to build simplistic intuition around them in languages that don't have robust built-in mechanisms that enable that kind of thinking.
For example, gen-testing in Clojure feels simple and intuitive, but if I have to do something similar, hmmm... I dunno, say in Java, I wouldn't even know where to start - the code inevitably would be too verbose; mutable state would obscure invariants; and the required boilerplate alone would cloud the core idea. While in Clojure, immutability-by-default feels transformative for gen-testing; data generation is first-class - generators are just values or functions; composition is natural; debugging is easier; invariants are verifiable; FP mindset already baked in - easier to reason about stateful sequences/event streams; there's far less ceremony overall;
But if I'm forced to write Java and see a good case for prop-based testing, I'd still probably try doing it. But only because I already acquired some intuition for it. That's why it is immensely rewarding to spend a year or more learning an FP language, even if you don't see it used at work or every day.
So, the honest answer to your question would probably be: "because FP languages are not more popular"
Property, fuzzy, snapshot testing. Great tools that make software more correct and reliable.
The challenge for most developers is that they need to change how they design code and think about testing.
I’ve always said the hardest part of programming isn’t learning, it’s unlearning what you already know…
I did some property-based testing in Haskell and in some cases the implementation was the specification verbatum. So what properties should I test? It was clearer where my function should be symmetric in the arguments or that there is a neutral element, etc..
If the property is basically your specification which (as the language is very expressive) is your implementation then you're just going in circles.
I find that most tutorials talk about "properties of function `foo`", whereas I prefer to think about "how is function `foo` related to other functions". Those relationships can be expressed as code, by plugging outputs of one function into arguments of another, or by sequencing calls in a particular order, etc. and ultimately making assertions. However, there will usually be gaps; filling in those gaps is what a property's inputs are for.
Another good source of properties is trying to think of ways to change an expression/block which are irrelevant. For example, when we perform a deletion, any edits made beforehand should be irrelevant; boom, that's a property. If something would filter out negative values, then it's a property that sprinkling negative values all over the place has no effect. And so on.
Is this approach something that makes sense in Go as well? I see at least one package[1][2] for Go, but I wonder if the approach itself makes sense for the language.
[1]: https://github.com/flyingmutant/rapid
[2]: Oh, I just found out about testing/quick.
Testing/quick is adequate for small things and doesn't introduce new dependencies, but it's also frozen. Many years ago, the Go team decided that PBT is complex enough that it shouldn't be in stdlib.
Here's a small example of a classic PBT technique used in a very practical Go project: https://github.com/connectrpc/connect-go/blob/cb2e11fb88c9a6...
I feel much more software documentation could greatly benefit from this approach. Describing the "why" and design tradeoffs helps me grok a system far better than the typical quickstart or tutorials which show snippets but offer little understanding. That is, they rarely help me answer: is this going to solve my problem and how?
That is the structure they (any many others) are following :).
Python is one example that comes to mind. They do have explanations here: https://docs.python.org/3/howto/index.html. And, to be fair, they are generally excellent in my opinion! But they're far from front and center and there's much less overall content compared to the other Diataxis types I think (granted, I haven't rigorously checked).
import { test, fc } from '@fast-check/vitest'
test.prop([fc.array(fc.double())])('sort is correct', (lst) => {
expect(lst).toEqual(lst.toSorted())
})
https://www.npmjs.com/package/@fast-check/vitest?activeTab=r...Was already familiar with and using Hypothesis in Python so went in search of something with similar nice ergonomics. Am happy with fast-check in that regard.
I haven't used it, or property-based testing, but I can see how it could be useful.
Proptest in Rust is mostly there but has many more issues with monadic bind than Hypothesis does (I wrote about this in https://sunshowers.io/posts/monads-through-pbt/).
If I remember right, it basically uses a binary 'tape' of random decisions. Shrinking is expressed as manipulations of that tape. Your generators (implicitly) define a projection from that tape to your desired types. Shrinking an early part of the tape, leave the later sub-generators to try and re-use the later parts of the tape.
That's not guaranteed to work. But it doesn't have to work reliably for every shrink operation the library tries! It's sufficient, if you merely have a good-enough-chance to recover enough of the previous structure to trigger the bug again.
How do you do that in general? I can't find any documentation on that.
Hypothesis does shrink the examples, though.
QuickCheck's shrinking is type based. There's lots of different ways to generate eg integers. Perhaps you want them in a specific range, or only prime numbers or only even numbers etc. To make QuickCheck's shrinker preserve these invariants, you'd have make a typed wrapper for each of them, and explicitly write a new shrinking strategy. It's annoying and complicated.
Hypothesis does this automatically.
data Rat = Rat Int Nat deriving (Eq, Show)
genRat = do
(num, den) <- arbitrary
pure (Rat num (1 + den))
`genRat` is a QuickCheck generator. It cannot do shrinking, because that's a completely separate thing in QuickCheck.We can write a shrinker for `Rat`, but it will have nothing to do with our generator, e.g.
shrinkRat (Rat num den) = do
(num', den') <- shrink (num, den)
pure (Rat num' den')
Sure, we can stick these in an `Arbitrary` instance, but they're still independent values. The generation process is essentially state-passing with a random number generator; it has nothing to do with the shrinking process, which is a form of search without backtracking. instance Arbitrary Rat where
arbitrary = genRat
shrink = shrinkRat
In particular, `genRat` satisfies the invariant that values will have non-zero denominator; whereas `shrinkRat` does not satisfy that invariant (since it shrinks the denominator as an ordinary `Nat`, which could give 0). In fact, we can't even think about QuickCheck's generators and shrinkers as different interpretations of the same syntax. For example, here's a shrinker that follows the syntax of `genRat` more closely: shrinkRat2 (Rat n d) = do
(num, den) <- shrink (n, d)
pure (Rat num (1 + den))
This does have the invariant that its output have non-zero denominators; however, it will get stuck in an infinite loop! That's because the incoming `d` will be non-zero, so when `shrink` tries to shrink `(n, d)`, one of the outputs it tries will be `(n, 0)`; that will lead to `Rat n 1`, which will also shrink to `Rat n 1`, and so on.In contrast, in Hypothesis, Hedgehog, falsify, etc. a "generator" is just a parser from numbers to values; and shrinking is applied to those numbers, not to the output of a generator. Not only does this not require separate shrinkers, but it also guarantees that the generator's invariants hold for all of the shrunken values; since those shrunken values have also been outputted by the generator (when it was given smaller inputs).
1. It requires you to essentially re-implement the business logic of the SUT (subject-under-test) so that you can assert it. Is your function doing a+b? Then instead of asserting that f(1, 2) == 3 you need to do f(a, b) == a+b since the framework provides a and b. You can do a simpler version that's less efficient, but in the end of the day, you somehow need to derive the expected outputs from input arguments, just like your SUT does. Any logical error that might be slipped into your SUT implementation has a high risk of also slipping into your test and will therefore be hidden by the complexity, even though it would be obvious from just looking at a few well thought through examples.
2. Despite some anecdata in the comments here, the chances are slim that this approach will find edge cases that you couldn't think of. You basically just give up and leave edge case finding to chance. Testing for 0 or -1 or 1-more-than-list-length are obvious cases which both you the human test writer and some test framework can easily generate, and they are often actual edge cases. But what really constitutes an edge case depends on your implementation. You as the developer know the implementation and have a chance of coming up with the edge cases. You know the dark corners of your code. Random tests are just playing the lottery, replacing thinking hard.
Not true. For example, if `f` is `+`, you can assert that f(x,y) == f(y,x). Or that f(x, 0) == x. Or that f(x, f(y, z)) == f(f(x, y), z).
Even a test as simple as "don't crash for any input" is actually extremely useful. This is fuzz testing, and it's standard practice for any safety-critical code, e.g. you can bet the JPEG parser on the device you're reading this on has been fuzz tested.
> You basically just give up and leave edge case finding to chance.
I don't know anything about Hypothesis in Python, but I don't think this is true in general. The reason is because the generator can actually inspect your runtime binary and see what branches are being triggered and try to find inputs that will cause all branches to be executed. Doing this for a JPEG parser actually causes it to produce valid images, which you would never expect to happen by chance. See: https://lcamtuf.blogspot.com/2014/11/pulling-jpegs-out-of-th...
> Such a fuzzing run would be normally completely pointless: there is essentially no chance that a "hello" could be ever turned into a valid JPEG by a traditional, format-agnostic fuzzer, since the probability that dozens of random tweaks would align just right is astronomically low.
> Luckily, afl-fuzz can leverage lightweight assembly-level instrumentation to its advantage - and within a millisecond or so, it notices that although setting the first byte to 0xff does not change the externally observable output, it triggers a slightly different internal code path in the tested app. Equipped with this information, it decides to use that test case as a seed for future fuzzing rounds:
The author of Hypothesis experimented with this feature once, but people usually want their unit tests to run really quickly, regardless of whether property based or example based. And the AFL style exploration of branch space typically takes quite a lot longer than what people have patience for in a unit test that runs eg on every update to every Pull Request.
Yup, a standard test suite just doesn't run for long enough for coverage guidance to be worthwhile by default.
That said, coverage-guided fuzzing can be a really valuable and effective form of testing (see eg https://hypofuzz.com/).
Hypothesis in particular does something neat where it tries to generate random inputs that are more likely to execute novel paths within the code under test. That’s not replicated in Rust but is super helpful about reaching more paths of your code and that’s simply not able to be done manually if you have a lot of non obvious boundary conditions.
But even for something like a+b, you have lots of properties you can test. All the group theory axioms (insofar as they are supposed to hold) for example. See https://news.ycombinator.com/item?id=45820009 for more.
No. That's one valid approach, especially if you have a simpler alternative implementation. But testing against an oracle is far from the only property you can check.
For your example: suppose you have implemented an add function for your fancy new data type (perhaps it's a crazy vector/tensor thing, whatever).
Here are some properties that you might want to check:
a + b == b + a
a + (b + c) = (a + b) + c
a + (-a) == 0
For all a and b and c, and assuming that these properties are actually supposed to hold in your domain, and that you have an additive inverse (-). Eg many of them don't hold for floating point numbers in general, so it's good to note that down explicitly.
Depending on your domain (eg https://en.wikipedia.org/wiki/Tropical_semiring), you might also have idempotence in your operation, so a + b + b = a + b is also a good one to check, where it applies.
You can also have an alternative implementation that only works for some classes of cases. Or sometimes it's easier to prepare a challenge than to find it, eg you can randomly move around in a graph quite easily, and you can check that your A* algorithm you are working on finds a route that's at most as long as the number of random steps you took.
> 2. Despite some anecdata in the comments here, the chances are slim that this approach will find edge cases that you couldn't think of. You basically just give up and leave edge case finding to chance. Testing for 0 or -1 or 1-more-than-list-length are obvious cases which both you the human test writer and some test framework can easily generate, and they are often actual edge cases. But what really constitutes an edge case depends on your implementation. [...]
You'd be surprised how often the generic heuristics for edge cases actually work and how often manual test writers forget that zero is also a number, and how often the lottery does a lot of the rest.
Having said this: Python's Hypothesis is a lot better at its heuristics for these edge cases than eg Haskell's QuickCheck.
> a + (b + c) = (a + b) + c
> a + (-a) == 0
Great! Now I have a stupid bug that always returns 0, so these all pass, and since I didn't think about this case (otherwise I'd not have written that stupid bug in the first place), I didn't add a property about a + b only being 0 if a == -b and boom, test is happy, and there is nothing that the framework can do about it.
Coming up with those properties is hard for real life code and my main gripe with formal methods based approaches too, like model checking or deductice proofs. They move the bugs from the (complicated) code to the list of properties, which ends up just as complicated and error prone, and is entirely un...tested.
Contrast that with an explicit dead-simple test. Test code doesn't have tests. It needs to be orders of magnitudes simpler than the system it's testing. Its correctness must be obvious. Yes, it is really hard to write a good test. So hard that it should steer how you architect your system under test and how you write code. How can I test this to have confidence that it's correct? That must be a guiding principle from the first line of code. Just doing this as an afterthought by playing lottery and trying to come up with smart properties after the fact is not going to get you the best outcome.
In general though, the advice is don't excessively dogmatic. If you can't devise a property to test, use traditional testing with examples. It's not that hard. Same way to deal with the problem of end-to-end tests or unit tests not being useful for certain things, use the appropriate test style and approach to your problem.
This sounds backwards to me. How could you write any tests, or indeed implement any functions, if you don't know any relationships between the arguments/return-value, or the state before/after, or how it relates to other functions, etc.?
For the addition example, let's say the implementation includes a line like `if (x == 0) return y`; would you seriously suggest that somebody writing that line doesn't know a property like `0 + a == a`? Would that only be "an afterthought" when "trying to come up with smart properties after the fact"? On the contrary, I would say that property came first, and steered how the code was written.
Incidentally, that property would also catch your "always returns 0" counterexample.
I also don't buy your distinction that "real life code" makes things much harder. For example, here's another simple property:
delete(key); assert lookup(key) == []
This is pretty similar to the `a + (-a) == 0` example, but it applies to basically any database; from in-memory assoc-lists and HashMaps, all the way to high-performance, massively-engineered CloudScale™ distributed systems. Again, I struggle to imagine anybody implementing a `delete` function who doesn't know this property; indeed, I would say that's what deletion means. It's backwards to characterise such things as "com[ing] up with smart properties after the fact".It's much easier to proceed explicitly like you suggest: have some properties in mind, co-develop property tests and code.
Often when I try to solve a problem, I start with a few simple properties and let them guide my way to the solution. Like your deletion example. Or, I already know that the order of inputs shouldn't matter, or that adding more constraints to an optimiser shouldn't increase the maximum, etc.
And I can write down some of these properties before I have any clue about how to solve the problem in question.
---
However, writing properties even after the fact is a learnable skill. You just need to practice, similarly to any other skill.
Well, just add one example based test for eg 1 + 2 = 3, and you would catch this bug.
You can freely mix and match property based tests with example based tests.
> Coming up with those properties is hard for real life code and my main gripe with formal methods based approaches too, like model checking or deductice proofs.
Yes, property based testing is an acquired skill that has to be learned.
It's not so bad, once you've done it a few times, though.
It also changes how you design code in the first place. Just like even example-based testing changes how you design your code.
> They move the bugs from the (complicated) code to the list of properties, which ends up just as complicated and error prone, and is entirely un...tested.
You might be doing this wrong, if your list of properties is as complicated as the code. And it's not 'untested': you are exercising your properties against your code.
For a conceptually really simple example: suppose you have two different ways to solve a certain problem, and suppose neither is really simpler than the other. You still gain something from comparing the output of both approaches on random input: unless you made the mistake in both approaches, many bugs will show up as discrepancies that you can investigate. Hypothesis even includes 'shrinking' to helpfully provide you with a minimal counterexample that shows the divergence.
From a formal point of view, you can say that this property is 'untested'; but in practice you still gain something by comparing the two implementations.
Not really, no, it's right there in the name: you should be testing properties (you can call them "invariants" if you want to sound fancy).
In the example of testing an addition operator, you could test:
1. f(x,y) >= max(x,y) if x and y are non-negative
2. f(x,y) is even iff x and y have the same parity
3. f(x, y) = 0 iff x=-y
etc. etc.
The great thing is that these tests are very easy and fast to write, precisely because you don't have to re-model the entire domain. (Although it's also a great tool if you have 2 implementations, or are trying to match a reference implementation)
I think you're manifesting some misconceptions and ignorance about property-based testing.
Property-based testing is still automated testing. You still have a sut and you still exercise it to verify and validate invariants. This does not change.
The core trait of property-based testing is that instead of having to define and maintain hard coded test data to drive your tests, which are specific realizations of the input state, property-based testing instead focuses on generating sequences of randomly-generated input data, and in the event of a test failing it follows up with employing reduction strategies to distil input values that pinpoint minimum reproducible examples.
As a consequence, tests don't focus on which specific value a sut returns when given a specific input value. Instead, they focus on verifying more general properties of a sut.
Perhaps the main advantage of property-based testing is that developers don't need to maintain test data anymore, and this tests are no longer be green just because you forgot to update the test data to cover a scenario or to reflect an edge case. Developers instead define test data generators, and the property-based testing framework implements the hard parts such as the input distillation step.
Property-based testing is no silver bullet though.
> Despite some anecdata in the comments here, the chances are slim that this approach will find edge cases that you couldn't think of.
Your comment completely misses the point of property-based testing. You still need to exercise your sut to cover scenarios. Where property-based testing excels is that you no longer have to maintain curated sets of test data, or update them whenever you update a component. Your inputs are already randomly generated following the strategy you specified.
> It requires you to essentially re-implement the business logic of the SUT (subject-under-test) so that you can assert it.
It does not, and it would be next to worthless if it did. It requires being able to define the properties required of the SUT and right implement code that can refute them if they are not present (the name "hypothesis" for this library is, in fact, a reference to that; PBT treats the properties of code as a hypothesis, and attempts to refute it.)
> but in the end of the day, you somehow need to derive the expected outputs from input arguments, just like your SUT does.
No, see my reimplementation of the sorting example without resorting to the builtin (or any) sorting function other than the one being tested:
https://news.ycombinator.com/item?id=45825482
> Despite some anecdata in the comments here, the chances are slim that this approach will find edge cases that you couldn't think of.
You may think this based on zero experience, but I have seen no one who has tried hypothesis even once who has had that experience. Its actually very good at finding edge cases.
> You as the developer know the implementation and have a chance of coming up with the edge cases.
You as the developer have a very good chance of finding the same edge cases when writing tests for your own code that you considered when writing the code. You have much less chance of finding edge case when writing tests that you missed when writing code. You can incorporate the knowledge of probable edge cases you have when crafting Hypothesis tests just as with more traditional unit tests—but with traditional unit tests you have zero chance of finding the edge cases you didn't think of. Hypothesis is, actually, quite good at that.
> Random tests are just playing the lottery, replacing thinking hard.
Property-based testing doesn’t replace the thinking that goes into traditional unit testing, it just acts as a force multiplier for it.
I like sorting as an example, and I like that using the built-in is concise, and reimplementing the behavior of an existing function where using the existing function as an oracle is a reasonable thing to do for a test isn't all that uncommon.
I feel like something with a couple of properties described in comments with assertions testing those properties (but where the functionality and properties are familiar enough that it would make a clear connection) would be a bit better, in theory, but I don't have a great particular example to use, and anything done that way will be, at best, somewhat less concise.