Here's a way weirder example:
volatile int x = 5;
printf("%d in hex is 0x%x.\n", x, x);
This is totally fine if x is just an int, but the volatile makes it UB. Why? 5.1.2.4.1 says any volatile access - including just reading it - is a side effect. 6.5.1.2 says that unsequenced side effects on the same scalar object (in this case, x) are UB. 6.5.3.3.8 tells us that the evaluations of function arguments are indeterminately sequenced w.r.t. each other.So in common parlance, a "data race" is any concurrent accesses to the same object from different threads, at least one of which is a write. In C, we can have a data race on a single thread and without any writes!
That said, your “common parlance” definition of “data race” is not the definition used by the C standard, so your last sentence is at best misleading in a discussion of standard C.
> The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.
(Here “conflicting” and “happens before” are defined in the preceding text.)
Lots of people mistakenly think that C and C++ are "really flexible" because they let you do "what you want". The truth of the matter is that almost every fancy, powerful thing you think you can do is an absolute minefield of UB.
int increment(int x) {
return x + 1;
}
Which is UB for certain values of x.If you want to be standards correct, yes you have to know the standard well. True. And you can always slip, and learn another gotcha. Also true. But it's still extremely flexible.
Maybe this already exists, even? A stripped down version of C? A more advanced LLVM IR? I feel like this is a problem that could use a resolution, just maybe not with enough of a scale for anyone to bother, vs. learning C, assembly of given architecture, or one of the new and fancy compiled languages.
Rust is a somewhat more thorough attempt to actually course-correct.
Edit: thread=thread of execution. I’m not making a point about thread safety within a program.
I'm also not convinced (yet) that the example really is UB: I agree reading a volatile is "a side effect" in some sense, and GP cited a paragraph that says just that. But GP doesn't clearly quote that it's a side effect on the object (or how a side effect on an object is defined). Reading an object doesn't mutate it after all.
But whatever language lawyer things, the code is obviously broken, with an obvious fix, so I'm not so interested in what its semantics should be. Here is the fix:
volatile int x;
// ...
int val = x; // volatile read
printf("%x %d\n", val, val);It is important to understand that this is a C level problem: if you have UB in your C program, then your C program is broken, i.e., it is formally invalid and wrong, because it is against the C language spec. UB is not on the HW, it has nothing to do with crashes or faults. That cast from void* to int* most likely corresponds to no code on the HW at all -- types are in C only, not on the HW, so a cast is a reinterpretation at C level -- and no HW will crash on that cast (because there is not even code for it). You may think that an integer value in a register must be fine, right? No, because it's not about pointers actually being integers in registers on your HW, but your C program is broken by definition if the cast pointer is unaligned.
> an unaligned pointer in itself is UB
Yup. Per the "Actually, it was UB even before that" section in the post.
> UB is not on the HW, it has nothing to do with crashes or faults
Yeah. I tried to convey this too, but I'm also addressing the people who say "but it's demonstrably fine", by giving examples. Because it's not.
It's perfectly reasonable to expect any load through `int*` to just load 4 bytes from memory, done and done. They get surprised that it is far from the whole story, and the result is UB.
Meanwhile, the actual computers we have been using for decades have no problems actually just loading 4 bytes through any arbitrary pointer with zero overhead. But no.
I'd clarify this with "They understand that all values are just bytes".
> Meanwhile, the actual computers we have been using for decades have no problems actually just loading 4 bytes through any arbitrary pointer with zero overhead.
It's partly the standards fault here - rather than saying "We don't know how vendors will implement this, so we shall leave it as implementation-defined", they say "We don't know how vendors will implement this, so we will leave it as undefined".
A clear majority of the UB problems with C could be fixed if the standards committee slowly moved all UB into IB. It's not that there isn't any progress (Signed twos-complement is coming, after all), it's that there is (I believe) much pushback from compiler authors (who dominate the standards) who don't want to make UB into IB.
I'd agree to a point. I still think it's unreasonable for compiler writers to get all lawyery about precise terminology. After all "implementation defined" could still be subject to the same lawyeriness (we implemented it, ergo we define it).
To me this is an issue of culture. We need to push back against the view that UB means anything can happen, therefore the compiler can do anything.
That said, I think there are many cases where compilers could make a better effort to link UB they're optimizing against to UB that appears in the code as originally authored and emit a diagnostic or even error out. But at least we've got ubsan and friends so it seems like things are within reason if not optimal.
Well I think there is a tension here. C is the language for microcontrollers and the language for high performance.
In ye olden days both groups interests were aligned because speed in C was about working with the machine. Now the UB has been highjacked for speed, that microcontroller that I'm working on, where I know and int will overflow and rely on that is UB so may be optimised out, so I then have to think about what the compiler may do.
I wouldn't say C is the wrong language. I would say there are wrong compilers though.
It's not only C-level is it. There's no (guarantee across architectures for) machine code for that either.
You can, and the results are machine specific, clearly defined and well-documented. Ancient ARM raises an exception, modern ARM and x86 can do it with a performance penalty. It's only the C or C++ layer that is allowed to translate the code into arbitrary garbage, not the CPU.
This is exactly what the compilers do if you use a packed structure to access unaligned data. Works everywhere, as expected. Compilers have always known what to do, they just weren't doing it. C standard says no.
The fact is the standard is garbage and the first thing every C programmer should learn is that they can and should ignore it. There is never any reason to wonder what the standard is supposed to do. The only thing that matters is what compilers actually do.
But then I would assume you are aware of unaligned pointers, and have a sane way to parse that data, rather than read individual parts of it from a raw pointer.
I am curious, what would be a legitimate reason for an unaligned pointer to int?
-Denial: "I know what signed overflow does on my machine."
-Anger: "This compiler is trash! why doesn't it just do what I say!?"
-Bargaining: "I'm submitting this proposal to wg14 to fix C..."
-Depression: "Can you rely on C code for anything?"
-Acceptance: "Just dont write UB."
Unaligned access? Packed structs. Compiler will magically generate the correct code, as if it had always known how to do it right all along! Because it has, in fact, always known how to do it right. It just didn't.
Strict aliasing? Union type punning. Literally documented to work in any compiler that matters, despite the holy C standard never saying so. Alternatively, just disable it straight up: -fno-strict-aliasing. Enjoy reinterpreting memory as you see fit. You might hit some sharp edges here and there but they sure as hell aren't gonna be coming from the compiler.
Overflow? Just make it defined: -fwrapv. Replace +, -, * with __builtin_*_overflow while you're at it, and you even get explicit error checking for free. Nice functional interface. Generates efficient code too.
The "acceptance" stage is really "nobody sane actually cares about the C standard". The standard is garbage, only the compilers matter. And it turns out that compilers have plenty of extremely useful functions that let you side step most if not all of this. People just don't use this because they want to write "portable" "standard" C. The real acceptance is to break out of that mindset.
Somehow I built an entire lisp interpreter in freestanding C that actually managed to pass UBSan just by following the above logic. I was actually surprised at first: I expected it to crash and burn, but it didn't. So if I can do it, then anyone can do it too.
Packed structs are dangerous. You can do unaligned accesses through a packed type, but once you take the address of your misaligned int field, then you are back into UB territory. Very annoying in C++ when you try to pass the a misaligned field through what happens to be generic code that takes a const reference, as it will trigger a compiler warning. Unary operator+ is your friend.
Gotta work with the structure directly by taking the address of the packed structure itself.
struct uu64 {
u64 value;
} __attribute__((packed));
struct uu64 unaligned;
struct uu64 *address = &unaligned;
address->value; // this works
u64 *broken = &address->value; // this doesn't
Taking the address of the field inside the structure essentially casts away the alignment information that was explicitly added to stop the compiler from screwing things up. So it should not be done.Mercifully, both gcc and clang emit address-of-packed-member warnings if it's done. So the packed structures are effectively turning silently broken nonsense code into sensible warnings. Major win.
It can be left as implementation defined, which means that the compiler can't simply do arbitrary things, it needs to document what it would do.
Take, for example, signed-integer overflow: currently a compiler can simply refuse to emit the code in one spot while emitting it in another spot in the same compilation unit! Making it IB means that the compiler vendor will be forced to define what happens when a signed-integer overflows, rather than just saying, as they do now, "you cannot do that, and if you do we can ignore it, correct it, replace it or simply travel back in time and corrupt your program".
> Somehow I built an entire lisp interpreter in freestanding C that actually managed to pass UBSan just by following the above logic. I was actually surprised at first: I expected it to crash and burn, but it didn't. So if I can do it, then anyone can do it too.
Same here; I built a few non-trivial things that passed the first attempt at tooling (valgrind, UBsan with tests, fuzzing, etc) with no UB issues found.
So we have the next best thing: builtins and flags. So long as those cover all the undefined behavior there is, we can live with it. Compiler gets to be "conformant" and we get to do useful things without the compiler folding the code into itself and inside out.
> -Acceptance: "Just dont write UB."
The point of my article is that this is not possible. This cannot be our end state, as long as humans are the ones writing the code. No human can avoid writing UB in C/C++.
Or you just not skip the introductory pages, that tell you what the language philosophy of C is, and why there is UB. Yes, UB can be a struggle, but the first four steps are entirely unnecessary. It means that you do not actually understand the core concepts of the very same language you are using, which is kinda stupid.
When that started happened people became alarmed (oMG UB iS TeH BAD!) and since some old UB machines still had industry support (of organisations that actually participated in ISO meetings instead of arguing online) there was never any movement on defining de-facto usage as de-jure and the alarmist position became the default.
Personally I think the industry would've benefited from a Boring C (as described by DJB) push by people that would've created a public parallell "de-jure" standard that would've had a chance to be adopted by compiler creators.
I guess I am too young, and also too much a purist, because I start from the impression of what the language is, not what the implementations happen to do.
> Personally I think the industry would've benefited from a Boring C (as described by DJB) push by people that would've created a public parallell "de-jure" standard that would've had a chance to be adopted by compiler creators.
-O0
Just switch to a saner language.
And before I get attacked for being a Rust shill, I meant Java :P
The bar is so low it's floating near the center of the Earth.
If all you want is C but less insane then the obvious answer here is Zig.
If all somebody want is a turn key replacement to C/C++ ecosystem, then there is nothing like that in the world that I’m aware of.
And where's the fun in that?
Unedefined behaviour means "we couldn’t settle on a best default trade-off with fine-tuning as a given option so we let everyone in the unknown".
Because the last time I looked it appeared to need some godawful slow bytecode interpreter that took up thousands of kilobytes of RAM.
https://www.graalvm.org/latest/reference-manual/native-image...
In the past you could use e.g. Excelsior JET.
Meanwhile half the world appears to run on cpython of all things.
The other obvious issue with the overall perspective is that C and C++ are being thrown together directly as if somehow they’re nearly the same language, but they are really very far apart nowadays.
Most C++ today will be immediately obvious and not accidentally mixed up with C.
http://archive.adaic.com/standards/83lrm/html/lrm-11-01.html
"STORAGE_ERROR This exception is raised in any of the following situations: (...) or during the execution of a subprogram call, if storage is not sufficient."
It's not safe though because throwing an exception, panicking, etc, is still a denial of service. It's just more deterministic than silently overwriting the heap instead. If the program is critical then you need to be able to statically prove the full size of the stack, which you can do with C and C++ with the right tools and restrictions.
If a segfault leads to some other state you do not deem "safe", such as a single program gating access to a valuable asset with a default fail state of "allow", you just have a fundamental design flaw in your system. The safety problem is you or your AI agent, not the segfault.
First, you can define what happens when stack space is exceeded. Second not all programs need an arbitrary amount of stack space, some only need a constant amount that can be calculated ahead of time. (And some languages don't use a stack at all in their implementations.)
Your language could also offer tools to probe how much stack space you have left, and make guarantees based on that. Or they could let you install some handlers for what to do when you run out of stack space.
You can even kind of retrofit this to C. The classic example is "sel4". You just need a set of proofs that the code doesn't trigger UB. This ends up being much larger and more complicated than the C itself.
How to think of this properly is that when you have UB, you are no longer under the auspices of a language standard. Things may work fine for a time, indefinitely even. But what happens instead is you unknowingly become subject to whimsies of your toolchain (swap/upgrade compilers), architecture, or runtime (libc version differences).
You end up building a foundation on quicksand. That's the danger of UB.
Tbh, already the first example (unaligned pointer access) is bogus and the C standard should be fixed (in the end the list of UB in the C standard is entirely "made up" and should be adapted to modern hardware, a lot of UB was important 30 years ago to allow optimizations on ancient CPUs, but a lot of those hardware restrictions are long gone).
In the end it's the CPU and not the compiler which decides whether an unaligned access is a problem or not. On most modern CPUs unaligned load/stores are no problem at all (not even a performance penalty unless you straddle a cache line). There's no point in restricting the entire C standard because of the behaviour of a few esoteric CPUs that are stuck in the past.
PS: we also need to stop with the "what if there is a CPU that..." discussions. The C standard should follow the current hardware, and not care about 40 year old CPUs or theoretical future CPU architectures. If esoteric CPUs need to be supported, compilers can do that with non-standard extensions.
This is in contrast to the other category that exists, which is "implementation-defined".
This is plain wrong. Undefined behaviour, means the C standard specifies no restriction on the behaviour of the program, which is what the implementation chooses to emit. An implementation can very well choose to emit any program it pleases, including programs that encrypt your harddisk, but also programs that stick to well defined rules.
Writing C (or any language) means adhering to the standard, because that's the definition of the language.
For most C software on x86_64, UB is "fine" with very strong bunny ears. But it is preferable for one to, shall we say, write UB intentionally rather than accidentally and unknowingly. Having an awareness of all the minefields lends for more respect for the dangers of C code, it makes one question literally everything, and that would hopefully result in more correct code, more often.
On that note, on some RISC-V cores unaligned access can turn a single load into hundreds of instructions.
I think the problem is just that C is under specified for what we expect a language to provide in the modern age. It is still a great language, but the edges are sharp.
However I do agree that just saying "the behaviour is undefined" is an unhelpful cop-out. They could easily say something like "non-atomic misaligned accesses either succeed or trap" or something like that.
> In the end it's the CPU and not the compiler which decides whether an unaligned access is a problem or not.
Not just the CPU - memory decides as well. MMIO devices often don't support misaligned accesses.
That means that the compiler must emit the read, even if the value is already known or never used, as it might trap. There is a reason for the UB!
https://lists.isocpp.org/std-proposals/2020/05/1322.php
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p28...
Low level platform-specific code that needs to hot spin until an interrupt happens can use assembly for that part which it will need to do for the interrupt handler anyway.
In worse scenarios, your programme will silently continue with garbage, or format your hard disk or give attackers the key to the kingdom.
Another commenter suggested using LLMs, but I disagree. Having clangd emit warning squiggles for unchecked operations (like signed addition) would be a good start.
Dead code elimination is essential for performance, especially when using templates (this is basically what enables the fabled "zero cost abstraction" because complex template code may generate a lot of 'inactive' code which needs to be removed by the optimizer).
The actual issue is that the compiler is free to eliminate code paths after UB, but that's also not trivial to fix (and some optimizations are actually enabled by manually injecting UB (like `__builtin_unreachable()` which can make a measurable difference in the right places).
Not, that the compiler can also emit code paths before UB, as UB is a property of the whole program, not just of a single statement.
enum op_t{ add, mul };
int exec(op_t op, int a, int b) {
if(op == add) { return a+b; }
if(op == mul) { return a\*b; }
}
c = exec(add, a,b);
Should be the compiler be prevented from inlining exec and constant-propagating op and removing the mul branch? What about if a and b are constants and the addition itself is optimized away? There are more things in heaven and earth, Horatio,
Than are dreamt of in your philosophy
You've probably been churning out possibly malformed code for years. Now you're becoming aware of your shortcomings. This is usually considered the transition from intermediate- to senior-level programmer.The real answer is that proponents of languages like C seem to completely disregard the dangers/difficulty of hitting/difficulty of fixing UB. Proponents of languages like Rust overstate it instead. Pointless wars/drama is fun to read and gets clicks.
There were a few high-profile "scandals" around GCC 3.2 (IIRC) because the compiler finally started much more aggressively using UB in optimizations, which was a reason that lots of people stayed on GCC 2.95 for a very long time. GCC 3.2 came out in 2002.
It's reasonable that such people would also be interested in design aspects of languages, and UB in C is in that field. Though I would argue that a lot of it was originally accommodating old CPU architectures without compromising performance too badly, and about as much a "design choice" as wheels being round...
I trust your historical C usage was more productive than that..
I have lots of my code running day-in, day-out on literally hundreds of millions of machines. The approach to "getting it working" is exactly OP's.
I'll admit to being pretty defensive and anal in checking values and return-codes (more so than most, I suspect), and I'm a firm believer in KISS principles in software engineering ("solving hard problems with complicated code is easy, solving them with simple, understandable algorithms is the hard bit") but generally there's no real difference in approach to the code I write to work on my workstation, and the code I write to work in the field.
Every company keep harping on about safety and being exposed (being in the news): so the narrative against 'unsafe' is up the wazoo.
The new world is basically a bunch of city dwellers who haven't seen raw nature and you show them a lawn mower, they freak out. Blades that spin?!?!?! Madness!!
Can't talk about C without CVE.
If you think C is the problem, you'll come to the eventual conclusion that humans are the problems, and greed. Don't hate the player, hate the game etc.
C was invented so you don't have to write assembly. It wasn't invented to expose devices to billions of other devices.
Although I haven’t noticed a spike the last 6 months, just a slowly increasing realization that C isn’t fit for humans and should go the way of asbest: Don’t use it for anything new, and remove it where it already exists, unless doing so would be too expensive or disruptive.
Personally I like C because you should have a good idea of what it's going to do. Other languages feel like a black box, and I start having to fight them far too often. But I say that as a hacker of low level stuff, not as someone who's paid and working on higher level stuff, so that is probably a niche view.
2. You don't really appreciate the issue. Signed integer overflow is undefined. If you check for that overflow after the fact the compiler can, and demonstrably has pretended that the overflow can't happen and optimised away your overflow check.
You may not even come across that failure mode to know to 'fix' it. And good luck finding the issue unless you know about UB and what the compiler can and will do in such situations.
After switching to Rust five years ago I agree with all the Rust hipsters as far as disliking those languages go.
I just don't talk about it a lot. If every Rust person I know that was a C/C++ developer before was as outspoken about what they think of the latter, you'd see that these people are a majority.
We're just old hands who like to use stuff that works. And most of us don't get attached to code or languages.
It's also difficult to yourself that you were never in command of a language as far as UB/other footguns go, as much as you thought. Or ever, for your enire career. For me that self-realization about C/C++ (enabled by Rust) was a turning point.
Lately you can read about the dichotomy re. AI use.
I.e. developers who define them themselve through what they build/ideas embracing LLMs for what they can do.
I.e.: I am what I build.
Whereas developers for whom software engineering is a craft that defines them hate them openly.
I.e.: I am how I build.
Now this seems to suggest to me that maybe Rust developers who openly hate C/C++ squarely belong to the latter group whereas the silent ones belong to the former. It's builders vs programmers. Just different world views.
Also you can not dislike something and still not speak about it because you decided to not care.
That word is carrying a lot of weight here. Compilers are unbelievably complex these days, and it's impossible for any one human to fully understand the entire compilation process, including the effects of any arbitrary combination of compiler flags.
Any assumptions you have about what the compiler does in the face of UB will collapse on the next patch release of that compiler, or the moment somebody changes the compiler flags, or the moment somebody tries to compile the code for a slightly different OS, not to mention architecture.
There is no other way to understand what C compilers do than reading the standard.
Linux works on a wide variety of platforms. It also relies on those platforms behaving predictably with respect to what the standard leaves undefined.
This description of ISO UB as a totally insane wonderland of random, malevolent semantics just doesn't describe reality.
- Creating a potentially troublesome misaligned int pointer is a precisely localized and completely explicit user mistake, not something that just happens because it's C.
- Passing signed char to character classification functions that expect an unsigned char (disguised as an int) is a very specific dumb user error. The C standard could specify that all negative inputs, including EOF and invalid signed char values, are classified as not belonging to the character class, but I doubt the current undefined behaviour in isxdigit() etc. implementations ever went beyond accepting invalid inputs.
- Casting floating point values to integer values in general requires taking care of whether the FP values are small enough to be represented and what to do with NaN and Inf values: not the language's responsibility. C offers a toolbox of tests, not ready-made application specific error handling.
- Expecting C to handle "address zero" in physical memory in ways that conflict with NULL in source code denotes a complete lack of understanding of what a program is. Where stuff in an executable is loaded in memory, in the rare cases when it matters, can surely be affected with platform specific extensions, possibly at the level of linker commands with nothing appearing in the C source code.So I see your counter points are all "so just don't do that, then".
And the point of my post is that this particular "just don't do that, then" has never been achieved by humans.
If if there's no example of a program without these bugs in a language, then I do think it's fair to blame the language. A knife with 16 blades and no handle.
> Expecting C to handle "address zero" in physical memory in ways that conflict with NULL in source code denotes a complete lack of understanding of what a program is.
Like the post says, it's rare that programmers actually want a pointer to memory address zero. But in my experience most programmers who even encounter that have this "complete lack of understanding", as you put it.
For example, you seem to underestimate how wrong placing negative values in a signed char is: ordinary character encodings do not use negative codes, so either those negative values are not characters and they have no business being treated as such, or something strange and experimental is going on.
As I stated:
> The following is not an attempt at enumerating all the UB in the world. It’s merely making the case that UB is everywhere, and if nobody can do it right, how is it even fair to blame the programmer? My point is that ALL nontrivial C/C++ code has UB.
It's about that point, not about how to avoid it. Because you can't.
Good opens source ones:
Frama-C
IKOS (from NASA)
The article suggests using LLMs to identify and fix UB. However as per the above, I think the issue is that we need more expert humans.
LLM generated code will eventually contain UB.
EDIT: added "eventually"
And it also signals that you actually do want to improve, just a little bit of boy scout rule goes a long way.
(only slightly tongue-in-cheek, I do believe that removing silly things is worthwhile).
> The article suggests using LLMs to identify and fix UB. However as per the above, I think the issue is that we need more expert humans.
Yup. But the point of the article is that even expert humans cannot do this alone. And as I wrote, LLM+junior won't suffice either. We need LLM+senior experts.
And it's a problem that we have way more existing UB than expert capacity.
Now, will LLMs and experts both miss UB in some cases? Of course. There's no 100% solution. But LLMs, I claim, will find orders of magnitude more, with low false positive, than any expert. Even if these expert humans (like in the OpenBSD case for the two bugs I found, one of which was UB) are given more than three decades to do it.
I didn't even use the best model, complex code target, or time. I just wanted to choose a target that has a high chance of having very good experts already having audited it.
Yes.
Even in languages other than C (i.e. you will get behaviour that nothing in the input specified).
When LLMs generate code, all languages have UB.
UB means literally no restrictions. So if you standard says 'you have to crash with an error message' that's already no longer UB.
Sure. For crashes. But when you instruct an LLM to do something, the output is probablistic, so you may get behviour that is unexpected and/or unwanted.
Like storing security tokens in code. Or nuking the production database.
The part about hardware is wrong BTW. In all the cases about null pointers and out-of-bounds access and integer overflow and whatnot, the hardware semantics are clearly defined, and the assembler code does exactly what is written. The way modern compilers act on your code makes C less safe than assembler in that sense.
> The part about hardware is wrong BTW
Could you be more specific? I think by "wrong" you may mean "not actually relevant to UB", and you're right about that. If that's what you mean then that part is not for you. It's for the "but it's demonstrably fine" crowd.
> the hardware semantics are clearly defined
Yup. The article means to dive from the C abstract machine to illustrate how your defined intentions (in your head), written as UB C, get translated into defined hardware behavior that you did not intend.
I'm not saying the CPU has UB, and I wonder what part made you think I did.
That's what I mean game of telephone. The UB parts get interpreted as real instructions by the hardware, and it will definitely do those things. But what are those things? It's not the things you intended, and any "common sense" reading of the C code is irrelevant, because the C representation of your intentions were UB.
I mean, you have to go out of your way and use a cast to get the UB in the first example.
For the `isxdigit` implementation, using a parameter to index into an array without a length check is pretty suspect already. I don't think any of my code actually indexes an array without checking the length in some way.
For the float -> int conversion, converting a float to an int without picking a conversion does not make sense in the first place - math.h has rounding and ceiling functions.
> For all you know the compiler has no internal way to even express your intention here.
I'm human, not a compiler, and even I cannot tell what the intention is behind trying to call NULL as a function. What exactly is expected to happen?
> Because the argument needs to be a pointer, and the NULL macro may be misinterpreted as an integer zero.
I don't think this is true for C. The NULL macro is defined to be a pointer in the C standard, AFAIK. Just because comparisons with zero are allowed, does not imply that the standard implicitly promotes NULL to `int`.
I think only the final one is of note (the 24-bit shift assigned to a uint64_t).
Probably confusion with C++ where NULL is 0 which is a special case that can be implicitly cast to both integers and pointers, unlike non-zero constants. C doesn't need this because it doesn't require explicit casts from void pointers to others.
https://nickyreinert.de/2023/2023-05-16-nerd-enzyklop%C3%A4d...
When comparing signed and unsigned integers of same size the signed one will be converted to unsigned. In a reasonably configured project compiler will warn about it.
In case of integers smaller than int, promotion to int happens first.
In case of signed and unsigned integers of different size, the smaller one will be converted to bigger one.
i = i++edit: I'm not sure it's even undefined in C.
Although many newer languages are safer (with the exclusion of Rust, primarily by being slower) the same kinds of issues that are there in C are there in these languages, their effects are just harder to see.
People complain about C as though they know how to fix it.
Brainfuck is "simple" by any other definition as well, but that's not a useful quality.
That doesn't mean the C is a safer language than Swift, or a less-capable language than Swift. But in terms of "easy to understand along the happy-path", it's a lot easier to get going in C.
Swift, for example, bakes a whole load of CS-degree-level ideas and concepts into the basic language with its optionals, unwrapping, type-inference, async/await, existential types, ... ... ... . C doesn't do any of that. There are (many!) more footguns in C, but the language is less complex as a result.
Brainfuck is not at all simple, from that point of view. This is a valid Brainfuck program:
++++++++[>++++[>++>+++>+++>+<<<<-]>+>+>->>+[<]<-]>>.>---.+++++++..+++.>>.<-.<.+++.------.--------.>>+.>++.
This is a valid C program
#include <stdio.h> int main() { printf("Hello\n"); }
One of these is far simpler than the other.
The brainfuck example is "simpler": Only 8 kinds of tokens! Not really useful, though.
The cognitive load of _actually delivering software_ written in C is immensely greater than doing so with Swift, or Rust, or Python, or Java, even Zig, despite all of those leveraging much heavier machinery in order to deliver a friendlier abstract model for you to program against.
The tragedy of C is that, in addition only delivering very baseline abstraction tools, it also adds its own set of seemingly arbitrary rules and requirements that come from nowhere but the C standard. Fictitious limitations to suit a bygone era. The abstract model of C is fine in some places, but definitely not fine in other places, and my hypothesis is that most UB in practice comes from a mismatch between programmer intuitions and C's idiosyncracies.
I'm not an expert in either language but my anecdotal experience disagrees with this - writing Zig has been far simpler and less error-prone than writing C.
Unaligned pointer accesses are UB because different systems handle it differently. This 'should' be to allow the program to be portable by doing what the system normally does.
Instead it's been highjacked by compiler writers, with the logic that "X is UB, therefore can't happen, therefore can be optimised away."
Int c = abs(a) + abs(b); If (a > c) //overflow
Is UB because some system might do overflow differently. In practice every system wraps around.
That should be a valid check, instead it gets optimised away because it 'can't' happen.
C gives you enough rope to hang yourself. The compiler writers don't trust you to use the rope properly.
Additionally, some (most?) UB is intentionally UB so that optimisers are free to do fancy tricks assuming that certain cases will never happen. Indeed, this is required for high performance. If they do happen, again, it can lead to unexpected behaviour.
PS: Most languages that don't have a specification declare their primary implementation to be specification-as-code. Rust is an example of that, and it does still have UB: the cases that the compiler assumes will not happen.
edit: for example I'm typing this into Safari which means probably every key press and event is going through JSC JIT compiled functions—which have, structurally and necessarily and intentionally, COMPLETELY undefined behavior according to the spec—and yet it miraculously works, perfectly, because the spec doesn't really matter
3.16 undefined behavior: Behavior, upon use of a nonportable or erroneous program construct, of erroneous data, or of indeterminately valued objects, for which this International Standard imposes no requirements. Permissible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message).
Is it just me or did compiler writers apply overly legalistic interpretation to the "no requirements" part in this paragraph? The intent here is extremely clear, that undefined behavior means you're doing something not intended or specified by the language, but that the consequence of this should be somewhat bounded or as expected for the target machine. This is closer to our old school understanding of UB.By 'bounded', this obviously ignores the security consequences of e.g. buffer overflows, but just because UB can be exploited doesn't mean it's appropriate for e.g. the compiler to exploit it too, that clearly violates the intent of this paragraph.
I touched on this in the "it's not about optimizations" section. It's not the compiler is out to get you. It's that you told it to do something it cannot express.
It's like if you slipped in a word in French, and not being programmed for French, it misheard the word as a false friend in English. The compiler had no way to represent the French word in it's parse tree.
So no, it's not overly legalistic. Like if the compiler knows that this hardware can do unaligned memory access, but not atomic unaligned access, should it check for alignment in std::atomic<int> ptr but not in int ptr? Probably not, right?
Aren't "unpredictable results" and "no requirements" contrary to the idea that the behavior would be "somewhat bounded"?
I don't think you could sincerely argue that this definition intends to allow the compiler to totally rewrite your code because of one guaranteed UB detected on line 5, just that it would be good to print a diagnostic if it can be detected, and if not to do what's "characteristic of the environment". Does that make sense?
Bounding UB would be a nice idea, or at least prohibiting time-traveling UB (and there is an effort in that direction). But properly specifing it is actually hard.
Only things you need to worry about then are things with actual observable side-effects - printf, volatile, etc - and C23 does specify that all observable behavior must happen even if UB follows, and compilers can't generally optimize function calls anyway (e.g. on systems on which you can define custom printf callbacks, you could put an exit(0) in such, and thus make it incorrect to optimize out a printf ever).
Reading adversarially is what people do who are looking for ways that something can be abused, from an offensive or defensive position.
Personally I am tired of the entire topic.
I've (fruitlessly) had this discussion on HN before - super-aggressive optimisations for diminishing rewards are the norm in modern compilers.
In old C compilers, dereferencing NULL was reliable - the code that dereferenced NULL will always be emitted. Now, dereferencing NULL is not reliable, because the compiler may remove that and the program may fail in ways not anticipated (i.e, no access is attempted to memory location 0).
The compiler authors are on the standard, and they tend to push for more cases of UB being added rather than removing what UB there is right now (for exampel, by replacing with Implementation Defined Behaviour).
(I hope casting fear is not UB)
I'm sure that's UB in C
In C++ just use <reinterpret_cast>
I say this as an experienced C developer.
Shrug.
Far from being just "C with classes", modern C++ is very different than C. The language is huge and complex, for sure, but nobody is forced to use all of it.
No HN comment can possibly cover all the use cases of C++ but in general, unless you have a very good reason not to:
- eschewing boomer loops in favor of ranges
- using RAII with smart pointers
- move semantics
- using STL containers instead of raw arrays
- borrowing using spans and string views
These things go a long way towards, shall we say, "safe-ish" code without UB. It is not memory-safe enforced at the language level, like Rust, but the upshot is you never need to deal with the Rust community :^)
>After all, C/C++ is not a memory safe language.
Will fix it.
In the context of UB discussion, the arguments apply equally to C and C++.
How would you write that?
I entirely agree with all your points that C and C++ are completely different languages at this point. And yet I wanted to write this post about something that is true for both.
However, that's obviously not the point? Ignoring the idea that people can/should just "git gud" and write perfect code in a language with lots of old traps, you can't control how everyone else writes their code, even on your own team once it gets big enough. And there will always be junior devs stumbling into the bear traps of c/c++ (even if the rest of the codebase is all modern c++). So no matter how many great new features get added to C++, until (never) they start taking away the bad ones, the danger inherent to writing in that language doesn't go away.
Also, safe != non-UB. TFA isn't so much about memory safety anyway.
In the end, everything comes down to culture war.
As far as stdlib usage is concerned: that's just your opinion. The stdlib has a lot of footguns and terrible design decisions too, e.g. std::vector pulling in 20k lines of code into each compilation unit is simply bizarre.
Also:
- eschewing boomer loops in favor of ranges
Those "boomer loops" compile infinitely faster than the new ranges stuff (and they are arguably more readable too): https://aras-p.info/blog/2018/12/28/Modern-C-Lamentations/
- borrowing using spans and string views
Those are just as unsafe as raw pointers. It's not really "borrowing" when the referenced data can disappear while the "borrow" is active.
> bool parse_packet(const uint8_t* bytes) {
> const int* magic_intp = (const int*)bytes; // UB!
Author, if you are reading this, please cite the spec section explaining that this is UB. Dereferencing the produced pointer may be UB, but casting itself is not, since uint8_t is ~ char and char* can be cast to and from any type.you might try to argue that uint8_t is not necessarily char, and while it is true that implementations of C can exist where CHAR_BIT > 8, but those do not have uint8_t defined (as per spec), so if you have uint8_t, then it is "unsigned char", which makes this cast perfectly safe and defined as far as i can tell. Of course CHAR_BIT is required to be >= 8, so if it is not >8, it is exactly 8. (In any case, whether uint8_t is literally a typedef of unsigned char is implementation-defined and not actually relevant to whether the cast itself is valid -- it is)
Of course, this exchange just demonstrates the larger point, that even a world-class expert in low level programming can easily make mistakes in spotting potential UB.
A "world-class expert in low level programming" knows that unaligned memory accesses are no problem anymore on most modern CPUs, and that this particular UB in the C standard is bogus and needs to fixed ;)
Pointer casts changing pointer bit sequences is common on weird platforms (eg: some TI DSPs, PIC, and aarch64+PAC). And it is valid as per spec. Pointer assignment is not required to be the same as memcpy-ing the pointer unto a pointer to another type.
You misunderstood the spec. No promises are made that that cast copies the pointer bit for bit (and thus creates an invalid pointer). Therefore, your objection to invalid pointers is null and void. :)
6.3.2.3 paragraph 7: A pointer to an object type may be converted to a pointer to a different object type. If the resulting pointer is not correctly aligned[footnote 68]) for the referenced type, the behavior is undefined. Otherwise, when converted back again, the result shall compare equal to the original pointer. When a pointer to an object is converted to a pointer to a character type, the result points to the lowest addressed byte of the object. Successive increments of the result, up to the size of the object, yield pointers to the remaining bytes of the object.
This is a subsection of section 6.3 which describes conversions, which include both implicit and conversions from a cast operation. This language is not saying anything about bit representations or derefencing.
I happen to be wearing my undefined behavior shirt at the moment, which lends me an extra layer of authority. I'm at RustWeek in Utrecht, and it's one of my favorite shirts to wear at Rust conferences. But let's say for the sake of argument that you are right and I am indeed misunderstanding the spec. Then the logical conclusion is that it's very difficult for even experienced programmers to agree on basic interpretations of what is and what isn't UB in C.
In practice performing a cast doesn't really do much until you dereference, but without a carve out in the spec, it does really mean "the behavior is undefined".
A "valid" pointer to the wrong object?
Doesn't this part exclude the possibility of rounding down?
> A pointer to an object type may be converted to a pointer to a different object type. If the resulting pointer is not correctly aligned71) for the referenced type, the behavior is undefined.
C23 6.3.2.3p7.
Great way to demonstrate the point of the article.
On recent Intel and 64-bit ARM processors, data alignment does not make processing a lot faster. It is a micro-optimization. Data alignment for speed is a myth. // https://lemire.me/blog/2012/05/31/data-alignment-for-speed-m...
(while in the olden days, a program may crash on unaligned access, esp on RISC)
I don’t see what spec part would prohibit that cast from validly compiling to
BIC r3, r0, #3
Spec only guaranteed round-trip through char* of properly aligned for type pointers. This doesn’t break that.(e.g. just compiling with address sanitizer and using static analyzers catch pretty much all of the 'trivial' memory corruption issues).