Linux Eliminates the Strncpy API After Six Years of Work, 360 Patches(www.phoronix.com)

72 pointsby simonpure3 hours ago5 comments

senfiaj33 minutes ago
I wonder, why not use a string buffer paired with its length? For example, maybe use struct that has char pointer, and 2 ints (occupied length + total buffer length). Almost like c++'s std::string. This null terminator thing really sucks, it's potentially insecure and often unperformant.
- none_to_remaina minute ago
  The size overhead of that is 2*sizeof(int) while the overhead of null termination is sizeof(char). If I remember the standard right, the former is worse by at least sizeof(char), and usually more in practice. This used to matter, sometimes still does.
- chiph6 minutes ago
  Pascal did/does this, but eventually someone wants a string longer than the size portion can handle. Or wants the number of characters not the number of bytes.
- GalaxyNova13 minutes ago
  Yes I have seen it happen a few times with `strlen` being called in a loop silently causing O(N) to turn to O(N^2)
mrlonglong3 hours ago
the zero terminated string is I think is computing's biggest mistake. Pascal style strings were much safer.
- dietr1ch2 hours ago
  I think it was NULL itself. It was a long way until we realised we don't want invalid values and could use the type system to help us use special values safely.
  - jkrejcha10 minutes ago
    The problem here is that null kinda is consequential of intentional design of the type system itself. Remember, C is a kinda "portable assembler" so the constructs in it are based relatively closely to how low level data structures are mapped out in memory.
    This is, and continues to be, an incredibly useful feature that makes C and C structs immensely useful concepts. Part of that does need an invalid value[1]. NULL is convenient for this and although there are some very weird JavaScript-trinity-meme-style consequences for this[2], it's such a useful concept that basically all languages that have the ability to construct pointers have a null pointer[3].
    The alternative world looks like everyone inventing their own invalid values. Invalid, non-null, pointers are typically MUCH worse than null pointers for debuggability and security. If you unintentionally read/write/execute memory at 0x0 (by far the most common value for NULL), most operating systems will trap this, whereas may not necessarily if 0x12345678 is your invalid value.
    [1]: Stuff like IA64 had NaT bits which were effectively an extra bit for what I assume to be this sorta thing. The problem with this is that it costs an extra bit. I don't really know much about IA64, but presumably [NaT 1] + [don't care] would be your null pointers here. I think?
    [2]: Really what the standard, in my opinion, should have done is probably not make use of the null pointer UB for many different functions. A lot of compilers took the UB surrounding that to make incredibly dubious "optimizations" that broke stuff with zero actual performance benefit whatsoever
    [3]: Yes, even Rust. Although some (again in my opinion) unfortunate design decisions made it so that C-Rust FFI isn't zero cost because of how it treats spans/slices
  - bellowsgulchan hour ago
    Compared to scripting languages with actual tagged types, C doesn't really have a type system, and that's readily apparent to anyone who has written C in the last 43 years and debugged a program written in it.
    C pretends types exist with you, but once bytes hit the road, it's all real-life and segmentation faults.
  - jkercher2 hours ago
    Meh, I think NULL is fine in C. It's an extra, valid state to represent pointers at no cost. Unlike the more hand holdy languages, it's quite rare for a pointer in C to have the ability to be NULL since, more often than not, it's pointing at something known. It's actually quite rare to see NULL checks unless it's API code or something like that. I can see this being more of a problem in a managed language where anything can be NULL at any time.
    bvrmn25 minutes ago
    NULL as a concept is fine. Inability to declare something as non-null is not.
    There is a huge gap between developer expectation "it's pointing at something known" and hard reality confirmed by zillions of CVE. That's the reason optionality is prevalent in modern languages and type checkers (python, typescript), nowdays even Java has sane non-nullable types.
    kelnos2 hours ago
    > to represent pointers at no cost
    I wouldn't call "cause of bugs and security issues" "no cost".
    > it's quite rare for a pointer in C to have the ability to be NULL
    As a C programmer for more than 25 years, that is the exact opposite of my experience.
    XorNot9 minutes ago
    The problem with let's get rid of NULL is that it's a real, required state. The vast majority of computing is actually not binary: any real input generally has at least 3 possible states: not set, true and false.
    In practice really 4 because "indeterminate" is a reasonable error condition you'd like to know about.
    And it keeps increasing anyway: e.g. not set has subcategories: not set due to lack of user input, not set because we're loading state from the backend etc.
    NULL is the first expression of that basic problem: it's definitely not enough to eliminate NULL because the first thing which happens is your non pointer default value takes it's place.
    UqWBcuFx6NV4r40 minutes ago
    This precise mindset is why the world has suffered for decades (wrt security/integrity/availability) at the hands of what can only be described as an industry led by completely unjustified male confidence. Why are there still people fighting the “it’s not that bad, guys! you’ve just got to be a good developer like ME!” fight?
    IgorPartola5 minutes ago
    Is None OK in Python?
    NULL in C just doesn’t belong at the end of a string. But IMO having a “there is no value here” designation is not a bad thing.
- jackbucks3 hours ago
  It was definitely an interesting way to allocate pointers. I did once have a very large project where devs didnt understand this and resolved hundreds or more off by one and memory overwrites in C due to this feature.
  But at the same time, I think blaming the software was kind of a cop out. Devs were in a hurry and simply didnt respect the rules. Given todays software engineer at large. Nerfing programming languages so they cant destroy things might not be a bad idea. But AI will nerf everything.
  - fragmede2 hours ago
    why is AI gonna nerf everything? sure it could be used as the easy button, but I just spent two hours this morning learning about the neuroscience of how memory works in the brain that I didn't mean to and now I want to run studies on how memory works.
    Why do you assume that AI is gonna nerf everything?
    AnimalMuppet15 minutes ago
    AGI might. AI? No way.
    See, AI was trained on existing data - on all that existing C code out there (sure, and also on all the papers and articles saying what was wrong with that C code). Those bugs are in the training data, and often not marked as bugs. So when AI generates C code, is it going to avoid making the mistakes that human code made? No, it's going to generate the kind of code it was trained on. How could it be otherwise?
    That's not going to nerf anything.
- bsder32 minutes ago
  Zero terminated string is a special case of sentinel value termination.
  And sentinel value terminations make a lot of sense when you have punch cards and fixed length records that you need to carve into pieces.
  Nobody expected any decisions they were making in the 1960s and 1970s to have any bearing on computing a half-century later. They all expected to have their mistakes long papered over by smarter people at some point.
  But we ALL make the mistake of underestimating inertia.
- fragmede10 minutes ago
  compared to Von Newman versus Harvard architecture for LLMs? I think that's a far bigger mistake.
- msla2 hours ago
  In addition to having to pick a size for the length counter and then, later, having to differentiate between lengths in bytes, codepoints, and glyphs, you can't subdivide a Pascal string using pointer arithmetic. To pass just the end of a string into a function, you have to either copy the tail of one Pascal-style string to another with a smaller size value, or your string has to be a struct with an integer and a pointer to the actual data instead of just an integer stuck on the beginning of the string. The first is a lot of copying in some cases, the second raises the specter of structs with invalid pointers. That's not to mention the potential problems that would cause with caches.
  - cornholio8 minutes ago
    You can have a universal variable length field, for example 2 bytes for strings < 32768, then four bytes, 8 bytes etc. On the critical short string path, it costs just a single bit test. The glyph vs byte issues need to be dealt with in both formats.
    The subdivision issue is a good perspective, but i would argue the performance impact of cloning substrings is dwarfed by the redundant full string reads to find length.
  - estebank31 minutes ago
    The third option is to have a variable width length: the top most bit signals whether the next byte corresponds to the length or to the start of the string.
- themafia2 hours ago
  > Pascal style strings were much safer.
  The limitations were brutal. Initially you could only have 255 bytes in a string. The length of a string and the size of the allocation are now separate and you may need to think about that unused memory in your design. The problem now doubles with the introduction of UTF-8. Your string size is in bytes and you need to track characters separately.
  If you want to create an array of strings you either need to specify the length of all strings and accept the memory overhead or have an array of pointers to strings. If you use an array of pointers you may end up choosing to use the 'nil' value as a sentinel that means "end of list." So we're right back where we started.
  --
  Because someone decided to downvote this HN has limited the speed at which I can reply. This site is tragic and I'm fully done with it now. You can spread propaganda and poorly sourced zeitgeist and be among friends but if you try to have a genuine conversation about programming languages you are made to be unwelcome immediately. Screw this.
  --
  > No other data structure works like this.
  The linked list.
  > You can't mess this up in an array
  C happily decomposes arrays into pointers. You can erase your length information from the type. This was an intentional decision.
  > Strings are the only data structure that assume there will be a NULL at end.
  Which is why almost every string API has a version that allows you to specify the maximum length. The fact that you can use a NUL doesn't mean you have to. Which is why the concept of "sentinel values" is broadly used in many types of applications you haven't considered here.
  - BigTTYGothGF11 minutes ago
    > Your string size is in bytes and you need to track characters separately
    No worse than C strings then.
  - AlienRobot2 hours ago
    >The problem now doubles with the introduction of UTF-8. Your string size is in bytes and you need to track characters separately.
    That isn't really a problem.
    The problem with null-terminated strings is specifically what happens when you reach the end of the allocated array and there ISN'T a NULL character.
    Every string function is designed to keep going until it finds the NULL character, so if a hacker gets rid of the NULL character, he can exploit pretty much any standard string manipulation function being used elsewhere in the program to manipulate whatever memory comes AFTER the string data structure.
    No other data structure works like this. You can't mess this up in an array, because no function that manipulates arrays is just going to keep going until there is a null. That would be stupid because it would require users of the function to add a NULL to the end of their arrays before passing it to the function, so instead we just pass the size of the array to everything. Strings are the only data structure that assume there will be a NULL at end.
    By the way, I read once that if you use UTF-32 every code point will be 4 bytes, constantly, but even then a single code point isn't necessarily a single character. Text is just complicated.
    tredre3an hour ago
    > No other data structure works like this.
    In C most data structures work like this, you keep going until you find NUL (character) or NULL (pointer). E.g. Strings, array of pointers, linked lists, etc. Of course you can add length to most of those, but it isn't the canonical/traditional way of doing things.
    AlienRobot11 minutes ago
    That can't be true. If you have an array of pointers it can be terminated in NULL. But an array of integers can't have a NULL value, since NULL would probably be just 0 which is a normal integer.
    The null in a linked list is the null in the .next field, right? That's the way you would implemented linked lists independent of language. It's not the .value that is null.
    A string is an array of characters (well, for characters representable in one byte at least) that has a specific value to represent the end of string.
    It would be like if Int::MAX was reduced by 1 to make space for an Int:NUL constant that represented the end of an integer array. Or if you were creating your own ENUM, let's say for NORTH, SOUTH, EAST, WEST, and you added a fifth enumeration called Direction.NUL for use in arrays.
PlunderBunny3 hours ago
I worked on a Win32 app that used space-padded strings, i.e. the destination string was padded with spaces, but there was still a null on the last byte. You had to use special versions of the string functions for length, copy etc.
I’m not sure why this was - the source base was so old it might have had its origins in Pascal struct behaviour.
- jkfkfkj2 hours ago
  It can perhaps be due to the string originating from a sql database ”char” field, I.e. not ”varchar”. Char fields in databases are space padded.
- bebe839392 hours ago
  Perhaps prevent realocation when string size changes? Or aligning cpu cache lines?
- egorfine2 hours ago
  I think this behavior has its roots in COBOL, not pascal.
  - kps38 minutes ago
    Which has its roots in punch cards, where pre-computer hardware operated on fixed-sized fields and an unpunched column is equivalent to a space.
naturalmovement2 hours ago
A reminder that we've had strlcpy[1] for ~ 30 years but it was never accepted into the Linux world because of typical petty open source bullshit. This is why we can't have nice things.
[1] https://man.openbsd.org/strlcpy
- ericbarrett29 minutes ago
  The Linux kernel had strlcpy over 20 years ago. It was removed in favor of strscpy because the latter was judged a better interface. Here's a 2022 article: https://lwn.net/Articles/905777/
- BoingBoomTschakan hour ago
  Actually, glibc 2.38 has it.
  - naturalmovement36 minutes ago
    Wow it only took them 26 years to import a 30 line C function, a third of which is comments?
    I should have sent them a nice fruit basket to commemorate the occasion.
larodi3 hours ago
Wonder when is someone going to brave and fork the linux kernel and try to ffwd it with automatic programming.
- fragmede2 hours ago
  why would you start there instead of creating something from scratch ?if you can port drivers just as easily meaning you don't especially give a shit about hardware you're running on in the first place, why even deal with linux? The battle tested LRU cache system?
  - literalAardvark2 hours ago
    It's much easier to use something with all the edge cases already handled as a starting point.
  - convolvatron2 hours ago
    I've seen several workalike kernels in various stages of completion. at least one of them was able to run some pretty substantial applications (Postgres, nginx, that kind of thing), and that is still I guess around 250kloc. but it only really has drivers to support hypervisor devices.
    unfortunately as time goes by, the linux api surface gets larger and more convoluted. so there's going to be some coverage you're just never going to get.
    but in the abstract, definitely. linux is so bloated at this point that its not clear that it can ever be 'made safe'.