"Parse, don't validate" through the years with C++(derekrodriguez.dev)

86 pointsby dwrodri6 days ago14 comments

foobar17263 days ago
It seems like the C++98 example is the best by far? Keeps all error information while remaining concise and easy to understand. Not to mention 50 times faster. (Could be improved by adding some simple type aliases like BirthYear that explicitly start from 1900.)
IMO the main takeaway is that malformed input is not an exceptional state when parsing, and should be treated as a first class citizen. Everything else is yak shaving how you want to handle the (status, validObject) tuple coming from the parser.
- philip-b3 days ago
  The compile time is 50 times faster, not the runtime.
_alphageek3 days ago
The C++11 example is the weakest in the article by its own thesis. Public throwing constructor, no year check, no leap-year check, so Birthdate(0, 2, 30) constructs cleanly. The C++17/23 shape (private ctor + static factory) is the actual mechanical insight from King's essay. Make the constructor a function that can fail, so the type itself carries the proof.
- simonask3 days ago
  Just to note, a throwing constructor is “just as good” as static factory method, provided you want to use exceptions for validation errors. Which you shouldn’t, but from the perspective of testing types as proof, it’s just as good.
- noitpmeder3 days ago
  exactly, use std::expected as the return type, avoid exceptions, and make a failable factory constructor to build your type. Make invalid states unrepresentable!!!
  - dietr1ch3 days ago
    Aren't you time-travelling? std::expected is C++23 (so available starting from 2025-2027 xd)
    https://en.cppreference.com/cpp/utility/expected
    diath3 days ago
    It has been available since GCC 12.1 (May 2022), Clang 19.1 (Sep 2024), and Visual Studio 17.13 (2022~): https://godbolt.org/z/on1v6qdf3
    These days compiler developers implement accepted standard features pretty fast.
    noitpmeder3 days ago
    And tl::expected (a largely identical impl) has been available similarly as long!
gsliepen3 days ago
The C example could have implemented a lot of validation just by checking the return value of sscanf():
```
    if (sscanf(user_input, "%4u-%2u-%2u", &year, &month, &day) != 3) {
        // return an error
    }
```
This still does not catch trailing garbage, but you could check for that as well:
```
    if (sscanf(user_input, "%4u-%2u-%2u%c", &year, &month, &day, &dummy) != 3) {
        // return an error
    }
```
The result would be 4 if there was at least one trailing character. Too bad there is still no std::scan() companion to C++23's std::print().
- tialaramex3 days ago
  Although it feels intuitively as though a std::scan could make sense, it doesn't, at least not with the sort of API I've seen suggested
  Consider a hypothetical Goose type, we can express any Goose usefully as output and, conveniently, some potential inputs could be read as a Goose successfully though most arbitrary strings cannot be understood as a Goose.
  Providing std::print for Goose is simple, we've got a variable (or maybe a constant) of type Goose, we just emit the correct sequence of symbols. It's annoying to actually write all the boilerplate in C++ 23 but that's mechanical it's not actually tricky to do just very boring (and so hence maybe C++ 26 makes that easier via reflection)
  But how could std::scan for Goose work? We need a Goose variable to potentially store the Goose if we read one, but how can we make a default Goose? No, each Goose is unique and there is no substitute, this can't work.
  The std::scan idea seem attractive for simple almost untyped input, strings, integers, that sort of thing, but the whole point of "Parse, don't validate" is that you probably want to parse email addresses and ISBNs and ISO dates, you don't want a string, another string and a third string.
  Rust's FromStr trait is more appropriate. Given a type implements FromStr we can parse any string to (maybe) get an instance of that type, but we don't need an "empty" instance first because we're doing the construction when we call the function.
  - gsliepen3 days ago
    Rust's FromStr only deals with parsing a single object. However, ideally std::scan() would be an exact counterpart of std::print() and would be able to parse multiple objects. I totally agree that the C way of passing references to already existing variables is not great. Ideally you return a tuple of objects, but then it becomes very annoying to specify the types. Maybe something like this?
    auto [value, text, goose] = std::scan<int, std::string, Goose>(input, "{} {} {}");
    A halfway solution would be to have the hypothetical std::scan() take references to std::optional<>s or std::expected<>s:
    std::optional<int> value; std::optional<std::string> text; std::optional<Goose> goose; /* auto result = */ std::scan(input, "{} {} {}", value, text, goose);
    The latter would be type safe, close to how scanf() works, but less satisfying from a functional programming standpoint.
    Orthogonal to that, adding support for scanning a Goose would be just like how you add a formatter for it, and would be quite similar to a Rust trait. One could imagine having to define something like this:
    template<> struct std::scanner<Goose> { constexpr auto parse(std::format_parse_context& ctx) {…} auto scan(std::format_context& ctx) const -> std::optional<Goose> {…} };
MarsIronPI3 days ago
Heh, I can especially tell the first code example is LLM-generated. Humans don't usually write comments like:
```
   // There are a few ways to let API callers bring their own 
   // memory, as they would in a no-malloc environment and this
   // stack-friendly c'tor is a stand-in for that. 
```
There's just something about this comment that doesn't feel right. I've seen these kinds of phrasings in LLM output before but I'm not sure exactly how to describe them.
mayoff3 days ago
The second sentence of your summary is fine, but I don’t like the first sentence:
> Use your language’s type system to parse unstructured inputs.
We don’t use the type system to parse. We use the type system to provide evidence (also called a proof or a witness) that parsing was successful, and we rely on the language’s access control facilities (public/private) and the soundness of its type system to prevent fabrication of false evidence.
usefulcat3 days ago
I don't see how this is in any way preferable to having an ordinary default constructor that does the same thing:
```
    // There are a few ways to let API callers bring their own 
    // memory, as they would in a no-malloc environment and this
    // stack-friendly c'tor is a stand-in for that. 
    static Birthdate epoch() { return Birthdate(1900, 1, 1); }
```
- plorkyeran3 days ago
  Some readers will expect Birthdate() to be equivalent to Birthdate(0, 0, 0), and naming it Birthdate::epoch() makes it clear that it is not that. I don't think it's worth it, but there is an upside.
- 3 days ago
  undefined
bregma3 days ago
Author has used LLMs to generate Java code in C++. It detracts from his point.
- pjmlp3 days ago
  What Java code?
  Regardless of how they might have used LLMs, I tend to have an issue with this kind of complaint, given the C++ example code on the Design Patterns: Elements of Reusable Object-Oriented Software book, released in 1994, 2 years before Java was made public.
  Or the examples from "Using the Booch Method: A Rational Approach", "Designing Object Oriented C++ Applications Using The Booch Method", or "Using the Booch Method: A Rational Approach".
  Additional there are enough framework examples starting with Turbo Vision in 1990, MacAPP in 1989, OWL in 1991, MFC in 1992,....
  Somehow a C++ style that was prevalent in the industry between 1990 and 1996, that I bet plenty of devs still have to maintain in 2026, has become "Java in C++".
  - bregma2 days ago
    > What Java code?
    A class with a passel of static member functions is Java code. It is not in any way idiomatic C++ code which has had namespace-level ("free") functions since it was invented as C-with-classes many decades ago. Using classes holding a whole lot of static member functions is strongly frowned on in the professional C++ community.
    pjmlp2 days ago
    Certainly not the professional C++ comunity that still uses frameworks born in the 1990's predating Java, or game engines.
  - antonvs3 days ago
    > Somehow
    There's not much mystery about that - Java took that approach and ran with it, and now has much greater mindshare than C++.
    Also, the mid-90s were before most software developers working today were born, I suspect. They'd have to go find a graybeard and ask them to tell them tales of yore, to find out about any of this.
    pjmlp3 days ago
    We gladly tell bonefire tales. :)
- SuperV12343 days ago
  No, it doesn't.
jsymolon3 days ago
First thought, assuming that birth year starts at 1900 is bad for a number of reasons; one of which, "process this list of authors and ..."
What about everyone born before 1900?
- alpinisme3 days ago
  It’s a contrived example. And I have to assume the author intended it to be contrived given that he also put an upper bound at 1999 in an article written in 2026 in an industry that skews young.
  But the pattern applies regardless of the validation logic.
- psychoslave3 days ago
  Assuming it is necessarily known which is the birth year of anyone assumed to have been in existence is already a big hypothesis if we go in that direction.
- Neywiny3 days ago
  Or what if they were born after 1999?
  It's just a toy example not a production ready birthday validation library.
3 days ago
undefined
blt3 days ago
I'm not a Haskell programmer, but from my limited awareness: Wouldn't they want to encode the restriction that April 31 doesn't exist directly in the type system instead of using raw integers for the underlying struct?
kstenerud3 days ago
C is perfectly capable of type-driven design. He's already got the type (struct), and although C is a bit limited, he can:
* return pointer-or-null
* choose "invalid" sentinel values and then use birthdate_is_valid(...) to check validity.
* Add an is_valid bool field (or even an error enum like in the C++23 example)
* Add an out field in the constructor function for the error code (similar to how ObjC does things).
- wk_end3 days ago
  The point of parse-don't-validate is that the type checker prevents you from having a value of a particular type that's invalid.
  Pointer-or-NULL doesn't work, because all pointers are nullable in C; you can always have a Foo* (NULL) that's doesn't actually point to a valid Foo.
  Invalid sentinel values are definitionally values of a particular type that are invalid. Same with an is_valid field.
  An out field in the constructor means that whatever you actually return in the case of an error is going to be a well-typed Foo that's invalid.
  - kstenerud3 days ago
    My point is that you do the checking at the call site, and then use a static analysis tool or an AI to enforce checking the result right after calling parse_birthday.
    Sure, Optional is more elegant, but the end result is the same: Now none of the other code needs to validate; it's already been verified valid at all points where a parse error could have occurred.
    C may not be an easy language, but with the right tooling you can make code safer, and idioms like parse-dont-validate possible.
- mrkeen3 days ago
  Cool, incredibly low bar.
  All four of your examples are validate.
  Know any languages that are worse than C at this?
- tech_hutch3 days ago
  Or use an out field for the type itself, and use the return value for an error code (or just a bool). A common pattern in C#.
rienbdj3 days ago
C++ could use some do-notation
- marcosdumay3 days ago
  Abstracting any part of code structure in C++ is a wasps nest that will attack you back.
  - lstodd3 days ago
    Did you mean "abstract you back"?
    Being abstracted by code you just wrote is quite a painful experience, yes.
actionfromafar3 days ago
Disregarding the article for a second, has anyone else had the pattern that "parse don't validate" makes sense in object oriented style, but less sense in functional style programming? Like parsing and validating blurs into each other.
- LittleLily3 days ago
  In my experience it makes even more sense in functional programming languages, not less, since they usually also have more powerful type systems that help with actually representing parsed vs unparsed data.
- gspr3 days ago
  > Disregarding the article for a second, has anyone else had the pattern that "parse don't validate" makes sense in object oriented style, but less sense in functional style programming?
  Parse, don't validate was written around Haskell!
  - actionfromafar3 days ago
    What I tried and apparently failed to express with "parsing and validating blurs into each other." was that parsing more easily becomes "just what you do" in functional style of programming. To the point that nowadays I can no longer really remember what I did back when I tried to "validate" things instead of parsing them.
- andrepd3 days ago
  The tl;dr is that instead of representing emails as type String and manually sprinkling is_email(str) throughout your code, you represent as type Email, which has a function parse(String) -> Option<Email>. The type system then ensures the checks are present whenever they have to be, and nowhere else.
  This is extremely natural to do in a language like Haskell or Rust. And incredibly unnatural to do in C++ for instance.
  - short_sells_poo3 days ago
    I hope this is not trolling so I'll bite. It is incredibly natural to represent an object, such as an email, as an Email class in object oriented languages like C++. It'd then have a constructor that accepts a string and constructs the email object from said string, or maybe a parse(string) -> Option<Email> thingy. The type system then ensures the checks are present whenever they have to be, and nowhere else.
    Tl;dr: there's nothing extra that functional or OO programming give you here. Both allow you to represent the problem in a properly typed fashion. Why would you represent an email as a string unless you are a) deeply inexperienced or b) have some really good reason to drop all the benefits of a strongly typed language?
    bananaboy3 days ago
    I completely agree with you but I think sometimes folks carry some piece of data around as a string or int instead of something more concrete like a class or a strongly typed enum etc purely out of laziness!
    MarsIronPI3 days ago
    I think the old Lisp tradition of using lists for everything is related to this somehow. On the other hand, in Common Lisp programmers can define custom types that have to fulfill a predicate function. Then, if they declare the types of their functions, most implementations will generate type-checking code unless instructed not to. So in Common Lisp you can use lists for everything but still have type-checking, at some cost to efficiency. :D
    leodavi3 days ago
    Well, in C++ the constructor must return a value of its class type - you can't return an Option<T> from a constructor on T, for example, and since constructors are the canonical way to construct an object, it creates stylistic and idiomatic friction when you start using free functions to create a Maybe<T> instead of constructors.
alphainfo3 days ago
[dead]