Make C string literals const?(gustedt.wordpress.com)

35 pointsby ingve11 days ago3 comments

jcalvinowens11 days ago
Modifying string litetals has never worked on any platform I've run code on the past 20 years. They're always in .rodata. I can't imagine doing this by default would be a problem except for really old code.
- _kst_10 days ago
  The C standard, since 1989, has said that attempting to modify the array object corresponding to a string literal has undefined behavior. Whether it "works" or not is not the issue.
  The problem is that it's currently legal to pass a string literal to a function expecting a (non-const) pointer-to-char argument. As long as the function doesn't try to write through the pointer, there's no undefined behavior. (If the function does try to write through the pointer, the behavior is undefined, but no compile-time diagnostic is required.) If a future version of C made string literals const, such a program would become invalid (a constraint violation requiring a diagnostic). Such code was common in pre-ANSI C, before const was introduced to the language.
  The following is currently valid C. The corresponding C++ code would be invalid. The proposal would make it invalid in C, with the cost of breaking some existing code, and the advantage of catching certain errors at compile time.
  #include <stdio.h> void print_message(char *message) { puts(message); // *message = '\0'; // would have undefined behavior } int main(void) { print_message("hello"); }
  - jcalvinowens10 days ago
    > Whether it "works" or not is not the issue.
    Of course it is. It doesn't work on anything modern, and thus it is impossible for portable code which actually runs in the real world and has to work to have relied on it for a long time.
    Your example is not code any competent C programmer would ever write, IMHO. Every proficient C programmer I've ever worked with used "const char *" for string literals, and called out anybody who didn't in review.
    Old code already needs special flags to build with modern compilers: I think the benefit of doing this outweighs the cost of editing some makefiles.
    _kst_10 days ago
    A conforming implementation could make string literals modifiable, and (obviously non-portable) code could rely on that. I don't know whether any current compilers do so. I suspect not.
    Apart from that, it's not about actually modifying string literals. It's about currently valid (but admittedly sloppy) code that uses a non-const pointer to point to a string literal. It's easy to write such code in a way that a modern conforming C compiler will not warn about.
    That kind of code is the reason that this proposed change is not just an obvious no-brainer, and the author is doing research to find out how much of an issue it really is.
    As it happens, I think that the next C standard should make string literals const. Any code that depends on the current behavior can still be compiled with C23 or earlier compilers, or with a non-conforming option, or by ignoring non-fatal warnings. And of course any such code can be fixed, but that's not necessarily trivial; making the source code changes can be a very small part of the process.
    Any change that can break existing valid code should be approached with caution to determine whether it's worth the cost. And if the answer is yes, that's great.
    jcalvinowens10 days ago
    > That kind of code is the reason that this proposed change is not just an obvious no-brainer
    I don't understand your point here: I disagree this is "obvious", and I don't think I've said anything to imply that?
    > And of course any such code can be fixed, but that's not necessarily trivial; making the source code changes can be a very small part of the process
    In many cases, it's so trivial you can write code to patch the code. Often, the resulting stripped binary will be identical, so you can prove it's not necessary to even test the result! If decision makers can be made to understand that, you can run around most corporate process that makes this sort of thing hard.
    I've spent a lot of time fixing horrible old proprietary code to use const because I think it's important: most of the time, it's very easy. I don't deny there are rats nests that require a lot of refactoring to unwind, but that is the exception rather than the rule, in my personal experience.
    It will be vanishingly rare that code will need to be modified in a way that actually changes its runtime behavior to tolerate the proposed change.
    _kst_10 days ago
    My point is that the risk of breaking existing code is the only reason not to apply this change to the standard.
    My point is also that that's a valid reason to proceed carefully before making the change.
    Even if the required source code changes are trivial or automatable, there will still be some variable amount of work required to deploy the changes. For a small program or library, maybe you can just rebuild and deploy. But for some projects, any change requires going through a full round of review, testing, recertification, and so on. For an update to code that controls a medical device or a nuclear reactor, for example, changing the code is the easy part.
    I support the proposed change. I also support performing all due diligence before imposing it on all future implementations and C software.
    jcalvinowens9 days ago
    > But for some projects, any change requires going through a full round of review, testing, recertification, and so on.
    If the new binary is literally identical to the last one which was passed validation, absolutely zero additional testing is required. It is a waste of resources to retest an identical binary (assuming everything else can be held constant of course, which obviously can't always be the case).
    Actually sending our hypothetical refactoring to production would itself be a waste of resources anyway, since the binary is identical... you just skip it, wait for the next real change, and then proceed as usual.
    All processes have exceptions, the "binary identical output" is an easy one if your leadership chain is capable of understanding it.
    And to be clear, "binary" here could absolutely mean "entire firmware image". The era of reproducible builds is upon us, and it is glorious.
    _kst_9 days ago
    Sure if the new binary is bitwise identical to the old one, there's no need to release it.
    But ...
    "The era of reproducible builds is upon us"
    What about old code built with old toolchains? And what about organizational policies that require a full round of testing for any update? How hard do you think it would be to change such policies?
    No doubt there's some software that could easily be modified, recompiled, and released. My point is that there is also some software that can't.
    And yes, in those cases the likely solution is to leave the code alone and continue to build it with the old toolchain.
    The point is that the proposed change will break existing valid code, and that has a non-zero cost. I support Jens Gustedt's effort to find out just what that cost is before imposing the change. (And again, I hope the change does go into the next edition of the standard.)
    ncruces10 days ago
    The most current SQLite amalgamation (3.49.1) is showing ~70 warnings when compiled with -Wwrite-strings.
    But maybe 70 warnings in 250k LoC is OK for your standards of proficiency.
    jcalvinowens10 days ago
    Surely you agree that is a problem that ought to be fixed in that code?
    70 warnings really doesn't sound that bad to fix. Most are probably trivial. I'm sure a few aren't.
    If nobody is around to fix it, that's what legacy flags are for.
- zabzonk11 days ago
  Yes, but C (or C++, for that matter) has no concept of .rodata. This is something that needs to be enforced by the compiler, as it is in C++, and why C programmers should probably simply use a C++ compiler, with its much stronger type checking.
  - jcalvinowens11 days ago
    You missed the point: I'm saying it has been impossible to modify string literals forever, so enforcing const is probably a non-issue except in very old C.
    10 days ago
    undefined
    zabzonk11 days ago
    It is completely possible to write C code which does attempt to write to string literals.
    moefh11 days ago
    The same is trivially true for C++: https://godbolt.org/z/h5znfchf8
    kazinator10 days ago
    But that uses a cast.
    Moreover, an old style cast. GNU c++ has an opt-in warning for those, -Wold-style-cast. You then need const_cast to get around that.
    Then we can grep the program for that new style cast (unless it token-pasted the const_cast together in the preprocessor haha).
    In C we can make such a program which contains no devices that defeat the type system, and which otherwise requires no diagnostic.
    jcalvinowens11 days ago
    I was just editing my comment to add this point :)
    jcalvinowens11 days ago
    No, it isn't. It will crash.
    moefh11 days ago
    It might crash, or it might work as naively expected, or it might do something else.
    For example, clang started simply omitting writes to data it knows to be read-only (which is allowed because these writes are undefined behavior, so anything goes). See this example[1]: `writable()` will return "*ello", but `readonly()` will just return "hello" and not crash (note its assembly doesn't include a write).
    [1] https://godbolt.org/z/MboK3hTPx
    jcalvinowens10 days ago
    That happens because the string is static. If you rewrite that so s is an argument to writable(), it will segfault.
    Although, I am curious if that optimization could happen across compilation units via LTO...
    moefh10 days ago
    I'm not sure what you mean. In `writable()`, there's no read-only data; `s` is a non-const char array (it has to be static because the function returns a pointer to it). The string literal is only there to tell the compiler how to initialize the array, `s` is not actually the string literal.
    If you change `writable()` to receive a `const char *` (and then cast it to `char *` to write), then clang will be forced to compile it with a store (even though it sees you storing to a `const char *`) because it doesn't know if the function will be called with a pointer to actual read-only data or just a pointer to writable data that was gratuitously converted to `const`.
    jcalvinowens10 days ago
    > because it doesn't know if the function will be called with a pointer to actual read-only data or just a pointer to writable data that was gratuitously
    That's exactly my point yeah, the optimization you described is only possible because you gave the compiler extra knowledge about the argument to that function (because it was static in the same compilation unit). It's artificial, typically that won't be the case.
    moefh10 days ago
    Ah, I understand you now, you're right.
    I remember there was a lot of confusion when llvm started removing stores to read-only memory[1], some people got angry because it broke some kernel code (that only worked because being in a kernel the memory page wasn't actually marked as read-only) and thought it would break any code that cast away a `const`, which is very common and valid as long as it was gratuitously `const`, as you say.
    [1] https://releases.llvm.org/9.0.0/docs/ReleaseNotes.html#notew...
    mmastrac11 days ago
    It is possible. Some platforms have no concept of rodata. You can mremap a segment. Lots of valid ways to do it.
    jcalvinowens11 days ago
    Well, at that point you deserve it :)
    acchow10 days ago
    A good reason to define it as an invalid program then and fail at compile time?
    jcalvinowens10 days ago
    Yes. I am literally arguing for doing that in this entire thread.
    the_svd_doctor10 days ago
    Right but some code will stop compiling, no?
    jcalvinowens10 days ago
    Yes. But such code can be fixed without functional changes.
    I'm not denying that there are codebases where trying this would result in an Armageddon of refactoring, but I would venture that's the exception rather than the rule.
    Most C programmers use "const char*" for string literals, and have for a long time.
- kazinator9 days ago
  If you suddenly make string literals const char, tons of previously correct code will require diagnostics. Code which requires diagnostics is incorrect and has undefined behavior if it is translated and executed anyway.
  C++ went through this over 20 years ago. I can't remember if it was already in c++03 or whether it was a post '03 draft feature.
- hun311 days ago
  The affected platforms lack an OS (e.g., bootloaders) and/or an MMU/MPU (e.g., microprocessors like AVR)
  - jcalvinowens10 days ago
    I don't care about platform specific stuff. I'm talking about C which is actually intended to be portable. Nothing written with portability in mind in the past ~decade is going to be doing this.
    dyhi5510 days ago
    C is not node.js. C exists for 50 years and is expected to have stable API. In scientific circles it's not unusual to compile c and f77 libraries built in the 70's, 80's.
    BLAS, gemv, GEMM, SGEMM libraries are from 1979, 1984, 1989. You may have seen these words scroll by when compiling modern 2025 CUDA :)
    jcalvinowens10 days ago
    I was writing C long before node.js existed :)
    C has no backwards compatibility guarantee, and it never has. Try compiling K&R C with gcc's defaults, and see what happens.
    You can build your legacy code with legacy compiler flags. Why do you care about the ability to build under the modern standards?
    hun310 days ago
    I think we're going a bit past each other.
    In AVR or other MPU-less architecture you can literally modify the string literal memory without triggering a crash.
    Why? Because there is no memory protection ("rodata") at all.
    And such microprocessors are still in use today, so it's a bit too far fetched to say "really old code."
    It's UB, sure, but how many embedded programmers actually care? The OP's proposal is trying to change the type system so that this UB becomes much less likely to trigger in practice.
Dwedit11 days ago
Wait, C string literals are not already const? On many platforms, they live in a read-only data section, which is write-protected memory.
- HeliumHydride11 days ago
  They're not const because of backwards compatibility. Const correctness in C is a lot weaker than the way C++ enforces it, letting you implicitly cast it away in a lot of cases.
  - jcalvinowens11 days ago
    On all modern platforms I'm familiar with, if you try to modify a string literal, you'll segfault. So while it's not const at the language level, it is very much const at the machine level.
    zabzonk11 days ago
    At runtime, yes. But I want to know about errors like this at compile time.
    jcalvinowens10 days ago
    That's not the point though. The point is that it's very unlikely any C written in the past 20 years relies on the ability to modify string literals.
    ryandrake10 days ago
    Not only that, but there are no valid C programs written ever, which rely on the ability to modify string literals. Doing so is undefined behavior, so the program is not valid. It may happen to work on some random platform, but it's still undefined.
    dyhi5511 days ago
    You're young. On all the legacy platforms I'm familiar with, you can modify string literals. That's original c.
    jcalvinowens11 days ago
    I guess you missed the word "modern"? Or are you saying you actually know of one?
    kevin_thibedeau10 days ago
    Microcontrollers running code loaded in RAM will have rodata linked into that RAM. Just takes an accidental cast to start writing them.
    jcalvinowens10 days ago
    True. All the more reason to make it an error, IMHO.
    dyhi5510 days ago
    Sure, choose any platform before 1990. The modern ansi / iso c didn't exist before 1990. The c language is from 1970's. So code from any old tarball will assume c literals are writable, and will crash if not. It's a common complain when compiling old code, google it. The c standard library is full of functions that assume strings are writable: mktemp() sscanf() strtok() etc.
    Quote from gcc manual, explaining why you need to compile old code with -writable-strings option: "you cannot call mktemp with a string constant argument. The function mktemp always alters the string its argument points to.
    Another consequence is that sscanf does not work on some systems when passed a string constant as its format control string or input. This is because sscanf incorrectly tries to write into the string constant. Likewise fscanf and scanf."
    jcalvinowens10 days ago
    I define "modern" as ANSI/ISO C. That's pretty conservative IMO, I know people who call pre-C99 "legacy C"...
- dyhi5511 days ago
  Strings including string literals are supposed to be writable for strtok() to work. Const char * is a modern c construct. You gotta deprecate parts of the standard c library, which will break backward compatibility...
  - kazinator9 days ago
    Using strtok on a string literal has been undefined behavior since ANSI C 89.
    The standard C library uses const char * almost everywhere where a string is accepted that will not be modified.
  - kevin_thibedeau10 days ago
    I have a strtok() clone for this purpose that returns a pointer range for each token, leaving the string untouched.
    kazinator9 days ago
    But then you have to copy out those pieces in order to have them null terminated so they can correctly function as strings.
- bodyfour11 days ago
  The issue is that "const" didn't exist in the earliest forms of C... and even when it became available not everybody started using it.
  So you might have a function that doesn't have proper "const" qualifications in its prototype like:
  void my_log(char *message);
  and then call-sites like:
  my_log("Hello, World!");
  ...and that needed to stay compiling.
  - kazinator9 days ago
    Some C projects have been ready for this for years due to supporting being compiled as C++.
- iknowstuff11 days ago
  Doesn’t the first paragraph address this?
KingLancelot11 days ago
[dead]