No way to parse integers in C (2022)(blog.habets.se)

52 pointsby konmok7 hours ago13 comments

orthoxerox3 hours ago
I wasn't in this class myself, but one prof at my alma mater started his "Programming 201" class with the simplest assignment: write a C program that accepts two integers from the user and prints their sum. It actually was the only assignment for the rest of the semester, since he has a test suite that would humiliate the students gently at first, but would ultimately pipe a billion nines into stdin as the first argument.
- dlcarrieran hour ago
  It's a little awkward, because you'd need to parse the strings in reverse, but if all you need to do is sum, you can do it one digit at a time, while at any given moment only handling only one character from each input string, a carry byte, and one output character.
- jeffrallen2 hours ago
  Would be fun to write a program that arranges to send the input into dc(1) and just outsource the whole problem to Ken or Rob or whoever wrote it. :)
- msie2 hours ago
  Perfect is the enemy of good.
  - pjc502 hours ago
    Once a program is available over the internet, hackers are the enemy of merely good programs that don't perfectly validate their input.
    "You have to get lucky every time. We only have to get lucky once".
    msie13 minutes ago
    Sure
    15 minutes ago
    undefined
  - chowellsan hour ago
    But in this case, C is not "good". It is more like "abysmal". "Good" is just producing a correct result or error, with no ambiguity which case applied and no UB. "Perfect" is arguing over the most usable and elegant API for it.
- clark_dent2 hours ago
  Could you humor a coding noob--how do you deal with utterly insane inputs like that?
  - wwalexander16 minutes ago
    Arbitrary precision arithmetic (GMP, BigInteger, etc). Numbers can take arbitrary amounts of memory, instead of just a single machine word.
  - doubled112an hour ago
    Crash and report an error.
    chowellsan hour ago
    You report an error and exit cleanly with a proper operating system error code. Crashing is a quick hack, acceptable for throwaway projects but not in software used long-term.
  - matthewkayinan hour ago
    You first ask if you really need to.
    AnimalMuppet30 minutes ago
    Unless you're exposing it to the internet, ever, in the entire future history of the program. Then you kind of have to, in one form or another.
16 minutes ago
undefined
bsenftner5 hours ago
One of the first homework assignments when I learned C back in '83 was after a long lecture on how the string functions are fundamentally broken, and the class introduction to writing C was fixing all of them.
- psvv3 hours ago
  My memory growing up is that making your own C library was basically an inevitable rite of passage for any aspiring programmer.
  - prerok18 minutes ago
    Yeah, it's a shame we never got something like boost for C. Every company I ever worked for had its own common C library solving these problems.
alexfoo4 hours ago
I remember an old project that ran into something like this. I think we just used atoi() or similar and the error check was a string comparison between the original input and a sprintf() of the converted value.
Ugly (and not performant if in a hot path) but it works.
zokier5 hours ago
I thought it was pretty well known that everything related to strings in C stdlib (including all str... functions) is bad. You just need to bring in your own string library.
- bhk2 hours ago
  Not just the string-related functions. If you want robust error checking, re-entrant code, and bounds checking performed in library functions (instead of performing bespoke validations all across your code base), you have some work to do. Yes, some improvements have been tacked on over the years, but many problems ("current locale", for one) remain endemic.
  In my experience, the worst part of the C standard library is not its existence, but the fact that so many developers insist on slavishly using it directly, instead of safer wrappers.
ramon1564 hours ago
Why not look at how other languages attack this? e.g. how does "42".parse() work in rust?
Edit: https://doc.rust-lang.org/src/core/num/mod.rs.html#1537
interesting! It boils down to this
pub const fn from_ascii_radix(src: &[u8], radix: u32) -> Result<u32, ParseIntError> {
```
    use self::IntErrorKind::*;

    use self::ParseIntError as PIE;

    // guard: radix must be 2..=36

    if 2 > radix || radix > 36 {

        from_ascii_radix_panic(radix);

    }

    if src.is_empty() {

        return Err(PIE { kind: Empty });

    }

    // Strip leading '+' or '-', detect sign

    // (a bare '+' or '-' with nothing after it is an error)

    // accumulate digits, checking for overflow

    Ok(result)

}
```
- marcosdumay2 hours ago
  It's not an overwhelming hard problem. There are some issues with radix signaling, exponent notation, decimal points being allowed or not, and group separators that make parsing numbers incredibly irritating. So you usually don't want to do it yourself.
  But it's not hard at all. It's not even as full of small issues that you can't handle the load, like dates. It's just annoying as hell.
  The problem is exclusive to C and C++. It's created by the several rounds of standardization of broken behavior.
voidUpdate5 hours ago
Cant you just:
```
  for(int i = 0; i < len(characters); i++)
  {
    if(characters[i]-48 <= 9 && characters[i]-48 >= 0)
    {
      ret = ret * 10 + characters[i] - 48;
    }
    else
    {
      return ERROR;
    }
  }
  return ret;
```
Adjust until it actually works, but you get the picture.
- knome4 hours ago
  this wouldn't catch overflow or underflow errors, nor does it allow non-base-10 numbers, nor does it handle negative numbers. and writing your own parser is a failure case by op's logic. they are complaining about the builtin parsing functions.
  the author admits you can parse signed integers in their second example, but for unsigned, they don't like seem to like that unsigned parsing will accept negative numbers and then automatically wrap them to their unsigned equivalents, nor do they like that C number parsing often bails with best effort on non-numeric trailing data rather than flagging it an error, nor do they like that ULONG_MAX is used as a sentinel value by sscanf.
  I'm not sure what they mean by "output raw" vs "output"
  $ cat t.c #include <stdlib.h> #include <math.h> #include <stdio.h> int main(int argc, char \* argv){ char * enda = NULL; unsigned long long a = strtoull("-18446744073709551614", &enda, 10); printf("in = -18446744073709551614, out = %llu\n", a); char * endb = NULL; unsigned long long b = strtoull("-18446744073709551615", &endb, 10); printf("in = -18446744073709551615, out = %llu\n", b); return 0; } $ gcc t.c $ ./a.out in = -18446744073709551614, out = 2 in = -18446744073709551615, out = 1 $
  I get their "output raw" value. I don't know what their "output" value is coming from.
  I don't see anywhere they describe what they are representing in the raw vs not columns.
  - thomashabets22 hours ago
    > they don't like seem to like that unsigned parsing will accept negative numbers and then automatically wrap them to their unsigned equivalents, nor do they like that C number parsing often bails with best effort on non-numeric trailing data rather than flagging it an error, nor do they like that ULONG_MAX is used as a sentinel value by sscanf.
    That's right. I don't like asking it to parse the number contained inside a string, and getting a different number as a result.
    That's just simply not the right answer.
    > I'm not sure what they mean by "output raw" vs "output"
    I can see how that's very unclear. Changed now to "Readable".
  - card_zero2 hours ago
    I think "output" is just supposed to be a human-readable version of "output raw". So the line in the table where "output raw" is 2 but "output" is 1 looks like a mistake. It's repeated in the table for sscanf().
    thomashabets22 hours ago
    Yup. Sorry about that.
- Sharlin4 hours ago
  And how does this avoid returning nonsense if the number is too large? (Wrapping if the accumulator is unsigned, straight to UB land if signed.) Not reporting overflows as errors is one of the major problems demonstrated by TFA.
  - voidUpdate4 hours ago
    you could check if ret > ret * 10 + characters[i]-48, if so it has wrapped around and you return an error
    thomashabets22 hours ago
    For unsigned that could work, but signed overflow is UB.
    Thiez3 hours ago
    [dead]
- fhdkweig3 hours ago
  What if the number you want to return just happens to be the value of ERROR? You need an error flag that can't be represented as an int, but then C wouldn't let you return it from a function that only returns "int". It is why some languages throw exceptions and why databases have the special "null" value.
  - voidUpdate3 hours ago
    I don't use C enough to know what the convention is for throwing an error when the function can return a number anyway. You'd have to ask someone else
    zbentley2 hours ago
    In C, errors are usually indicated by a negative return value constant, crashing the program with abort, or setting the errno global (thread-local, but whatever) and expecting callers to check it. Sometimes multiple of those.
    QuercusMaxan hour ago
    One reasonably common pattern is to have the return value indicate success / error, and you pass in a pointer to the value which will be mutated if successful.
  - jerf3 hours ago
    And why some very, very special languages have an effectively-global variable called "errno" that you have to check after the call manually, and worry about whether maybe it was populated from some previous error. Nothing says "production-quality language that an entire civilization's code base should be based on" like "sometimes (but only sometimes!) functions return additional information through global values".
    aleph_minus_one3 hours ago
    > And why some very, very special languages have an effectively-global variable called "errno" that you have to check after the call manually, and worry about whether maybe it was populated from some previous error.
    As you can read at https://en.wikipedia.org/wiki/Errno.h errno is barely used by the C standard (though defined there). It is rather POSIX that uses errno very encompassingly. For example the WinAPI functions use a much more sensible way to report errors (and don't make use of errno).
- bitwize4 hours ago
  You cannot "just" anything in C without hitting a minefield of UB. It is, probably, more economical to convert your entire project to Rust than it is to do the pufferfish spine removal procedure of auditing the code base for UB and replacing the problem areas. With generative AI, the size of project for which this remains true may be as large as "the entire Linux kernel".
contubernio2 hours ago
One of the great virtues of C is that this sort of thing is not part of the language ...
- thomashabets2an hour ago
  Only literally. 7.24.1 in the C programming language spec has these poor parsers.
  - rbanffy42 minutes ago
    Is their misbehavior part of the spec as well? If not, we can always add the correct behavior to the spec and let anyone who implemented a broken version deal with fixing every program compiled using it.
    thomashabets218 minutes ago
    Fair enough.
    For strtoul and friends, maybe? 7.24.1 is pretty dense, but the key parts are "the expected form of the subject sequence is a sequence of letters and digits representing an integer with the radix specified by base, optionally preceded by a plus or minus sign […] If the correct value is outside the range of representable values […] ULONG_MAX […] is returned".
    So the "expected form" allows a minus sign, but then it's clearly "outside the range of representable values" for strtoul to try parsing a negative value. So maybe it should return ULONG_MAX on those.
    So arguably a minus sign present could already be treated as an error, and still be standard compliant. Unless I'm misreading.
jervant4 hours ago
https://man.openbsd.org/strtonum
- bmandale3 hours ago
  Interestingly fails as well, in two ways. First:
  > The string may begin with an arbitrary amount of whitespace (as determined by isspace(3))
  Second is that it only applies to signed long long, not unsigned.
eithed4 hours ago
Can't you regex that given string contains just numbers and then use any of the provided methods? Then check if the returning value is a number to cater for edge cases
Ok, having a method to do that for you would be nice, but the post reads like it's an issue that std library doesn't provide you with a method behaving as you exactly want
CodesInChaos3 hours ago
Another case many integer parsing functions get wrong is that they interpret a leading 0 as an octal indicator.
That should be opt-in via a flag, if it needs to be supported at all. Unix file permissions are the only deliberate use of octal I've ever seen.
- kevin_thibedeau2 hours ago
  It used to be much more common. In the 70s there was a lot of collective hesitance to use hex with its strange letter digits. Octal was the compact representation of choice.
chadgpt33 hours ago
... say users of only language with no way to parse integers.
:)
stephc_int134 hours ago
As a C programmer, I find this kind of bad faith article very irritating.
Yes, the standard library is bad. This is by far the worst part of the C legacy. But it is not that hard to write your own.
String functions like this are not difficult at all, and you can use better naming and semantics, write faster code etc.
C is not the C standard library, ffs.
- konmok4 hours ago
  I don't think it's in bad faith.
  The distinction between a language and its standard library gets blurry even in theory, and in practice they're nearly inseparable. If a language's standard library has four ways of doing almost the same thing, and they're all fundamentally broken, that's a problem.
  - stephc_int133 hours ago
    If you read the other articles by the same author on his blog, you'll see that he has some strong and weird opinions about C and UB.
    Complete BS in my opinion.
  - dosisking4 hours ago
    [flagged]
- alexfoo4 hours ago
  Exactly. A wrapper that handles all of the edge cases properly and gives proper reporting just gets added to your own library of functions and the devs get used to using it. Much like the code for abstract data types like lists/hashmaps/etc which neither C nor the standard libraries provide.
  Bonus points for having bespoke linting rules to point out the use of known “bad” functions.
  In one old project we went through and replaced all instances of sprintf() with snprintf() or equivalent. Once we were happy that we’d got every occurrence we could then add lint rules to flag up any new use of sprintf() so that devs didn’t introduce new possible problems into the code.
  (Obviously you can still introduce plenty of problems with snprintf() but we learned to give that more scrutiny.)
  - thomashabets22 hours ago
    While snprintf() is better than sprintf(), I find that it's easy for people to not check if the return value is bigger than the provided size. Sure, it prevents a buffer overflow, but there could still be a string truncation problem.
    Similar to how strlcpy() is not a slam dunk fix to the strcpy() problem.
    alexfooan hour ago
    That's partly the point.
    If someone uses sprintf() you have to go faffing around to check whether they've thought about the destination buffer size. The size of the structure may be buried far away through several layers of other APIs/etc.
    Using snprintf() doesn't solve this in any way, but checking whether the new use of snprintf() checks the return value is relatively simple. Again, there's still no guarantee that there aren't other problems with snprintf() but, in our experience, we found that once people were forced to use it over sprintf() and had things checked in PR reviews we found that the number of instances of misuse dropped dramatically.
    It wasn't the switch of functions that reduced the number of problems we saw, but the outright banning of the known footgun `sprintf()` and the careful auditing and replacement of it with `snprintf()` that served as a whole load of reference copies for how to use it. We spread the work of replacing `sprintf()` around the team so that everyone got to do some of the switches and everyone got to review the changes. And we found a whole load of possible problems (most of which were very unlikely to ever lead to a crash or corruption.)
    The same would apply if you picked any other known footgun and did similar refactoring/rewrites/auditing/etc.
    Anyway, I haven't done C commercially/professionally for about 5 years now. I do miss it though.
  - 17186274404 hours ago
    > like lists/hashmaps/etc which neither C nor the standard libraries provide
    There is a hashmap implementation though: https://man7.org/linux/man-pages/man3/hsearch.3.html
    steveklabnik2 hours ago
    “One hashmap for your entire program” is not generally what people mean when they want a hashmap.
    alexfooan hour ago
    > The three functions hcreate_r(), hsearch_r(), hdestroy_r() are reentrant versions that allow a program to use more than one hash search table at the same time.
    alexfoo3 hours ago
    Sure there's an implementation, but like the integer comparison functions that sparked this thread there are some severe limitations with the implementation.
    (In fact, looking at it again, I assume I'd purposely purged it from my memory given how terrible it is.)
    The non-extensible nature is the biggest one. There are plenty of times when the maximum number of elements needed to be stored will be known in advance. (See the note about hcreate().)
    Secondly the hserach() implementation requires the keys to be NUL terminated strings since "the same key" is determined using strcmp(). Good luck if you want to use a number, pointer, arbitrary structure or anything else as a key.
    Any reasonable hash table implementation would not have either of these limitations.
    Maybe I needed to say:
    > > like lists/hashmaps/etc which neither C nor the standard libraries provide
    ... reasonable implementations of.
- wang_li4 hours ago
  The thing I find irritating is all the folks who say C is broken because it’s not a write once run anywhere language like JavaScript or python. Part of the deal has always been that the programmer needs to understand the target platform and the target compiler’s behavior.
  - DowsingSpoonan hour ago
    Write once run anywhere? But C already is a "write once run anywhere" language! Though, you usually have to recompile first :)
    The criticisms related to UB are not about understanding the target platform and the target compiler's behavior. Undefined Behavior is not the same thing as Implementation-defined Behavior, and lots of folks (including me) would be satisfied with reclassifying chunks of UB as the latter.
    The behavior of the target platform isn't really the issue. C23 mandates two's complement for signed integers. Most hardware wraps on overflow, but that literally doesn't matter. The standard says a program exhibiting signed overflow is undefined, period.
    In practice, UB rules mean the compiler is free to remove checks for signed overflow/underflow, checks for null pointers, etc. This can and does happen. Man, just a few weeks ago, I just had to deal with a crash in a C program that turned out to be due to the compiler removing a null check. That was a painful one.
  - thomashabets2an hour ago
    The point of this post, though, is even something as simple as "give me this string as an integer" doesn't have an answer that doesn't come with "are you OK with this best effort parse under these edge cases? Oh and we use this number as error, so you can't parse that".
    Like… edge cases? It's parsing a number! We're not talking about I/O on hard vs soft intr NFS mounts, here. There's a right answer.
    strlen(), on valid null terminated strings, doesn't come with caveats like "oh we can't measure strings of length 99".
    But sure, C is turing complete. It is possible to solve any problem a turing machine can solve.
    > understand the target platform and the target compiler’s behavior.
    This is neither. This is purely the language.
  - mswphd2 hours ago
    isn't the whole point of C that it's portable assembly though? needing to understand the target platform/compiler's behavior to write correct code seems to cut against that claim quite a bit.
    wang_lian hour ago
    No. What gives that idea? The language doesn't even fix the data size of its primary numerical type. No way anyone thought that was portable.
    konmokan hour ago
    Is this sarcasm? I thought C didn't fix the size of int because they were trying to make C programs "portable" between architectures with different natural word sizes. It was a mistake, but I remember that as being the stated reason. I'm happy to be corrected if I'm misremembering my history though.
- msie2 hours ago
  The people downvoting you are probably not C programmers and love to hate C.
  - card_zero18 minutes ago
    I guess trying to write in Rust makes them irritable.