209 pointsby ingve9 months ago25 comments

SleepyMyroslav9 months ago
In gamedev there is simple rule: don't try to do any of that.
If it is text game needs to show to user then every version of the text that is needed is a translated text. Programmer will never know if context or locale will need word order changes or anything complicated. Just trust the translation team.
If text is coming from user - then change design until its not needed to 'convert'. There are major issues just to show user back what he entered! Because the font for editing and displayed text could be different. Not even mentioning RTL and other issues.
Once ppl learn about localization the questions like why a programming language does not do this 'simple text operation' are just a newcomer detector. :)
- fluoridation9 months ago
  >Once ppl learn about localization the questions like why a programming language does not do this 'simple text operation' are just a newcomer detector. :)
  I think you are purposefully misinterpreting the question. They're not asking about converting the case of any Unicode string with locale sensitivity, they're asking about converting the case of ASCII characters.
  What if your game needs to talk to a server and do some string manipulation in between requests? Are you really going to architect everything so that the client doesn't need to handle any of that ever?
  - SleepyMyroslav9 months ago
    >What if your game needs to talk to a server and do some string manipulation in between requests? Are you really going to architect everything so that the client doesn't need to handle any of that ever?
    Of course! Your string manipulation with user entered attributes like display names or chat messages are 1 millimeter away from old good sql 'Bobby; drop table students'. Never ever do that if you can avoid it. Every time someone 'just concatenates' two strings like to add ie 'symbol that represents input button' programmer makes bad bug that will be both annoying and wrong. Games should use substitution patterns guided by translation team. Because there is no ASCII culture in like around 15 typically supported by big publishers.
    There are exceptions like platform provided services to filter ban words in chat. And even there you don't have to do 'things with ASCII characters'. Yeah, players will input unsupported symbols everywhere they can and you need to have good replacement characters for those and fix support for popular emojis regularly. That is expected by communities now.
  - squeaky-clean9 months ago
    > They're not asking about converting the case of any Unicode string with locale sensitivity, they're asking about converting the case of ASCII characters.
    I'm confused now. The article specifically mentions issues with UTF-16 and UTF-32 unicode characters outside the basic multilingual plane (BMP).
    fluoridation9 months ago
    I'm referring to the people who call case conversion in general "a simple text operation". Say you have an std::string and you want to make it lower case. If you assume it contains just ASCII that's a simpler operation than if you assume it contains UTF-8, but C++ doesn't provide a single function that does either of them. A person can rightly complain that the former is a basic functionality that the language should include; personally, I would agree. And you could say "wow, doesn't this person realize that case conversion in Unicode is actually complicated? They must be really inexperienced." It could be that the other person really doesn't know about Unicode, or it could mean that you and them are thinking about entirely different problems and you're being judgemental a bit too eagerly.
    squeaky-clean9 months ago
    For ascii in C++ isn't there std::tolower / std::toupper? If you're not dealing with unsigned char types there isn't a simple case conversion function, but that's for a good reason as the article lays out.
    fluoridation9 months ago
    Those functions take and return single characters. What's missing is functions that operate on strings. You can use them in combination with std::transform(), but as the article points out, even if you're just dealing with ASCII you can easily do it wrong. I've been using C++ for over 20 years and I didn't know tolower() and toupper() were non-addressable. There's really no excuse for the library not having simple case conversion functions that operate on strings in-place.
    squeaky-clean9 months ago
    std::transform() seems like overkill when you can just iterate over the string and modify it in place. And in my opinion, tranform is way less readable than seeing a loop over some array with a single operation inside.
    The article talks about wstrings for good reason. If you're converting narrow strings, you don't need to be this fancy. Just loop over the string and edit it in place.
    If you are operating on wide strings, there is no suitable single solution, partly because wstring is a terrible type. It's different widths on different platforms, and no string encoding format uses a generalized wsring, they have mandatory min/max character byte widths. So a wstring tells you nothing about the actual encoded string contents semantic representation.
    The C++ stdlib could include a fully unicode aware string type set, and surrounding library. But personally I think C++ isn't the kind of language to provide an opinionated stdlib module for such a complex task. And there's no way to implement such a module without being very opinionated about something.
    dwattttt9 months ago
    > The article talks about wstrings for good reason. If you're converting narrow strings, you don't need to be this fancy. Just loop over the string and edit it in place.
    Since you mention narrow strings in the context of wstring, just to make sure... you can't convert a UTF-8 std::string character by character, in-place (in case that's what you meant).
    7-bit ASCII code points are fine, but outside that it's not guaranteed that one UTF-8 byte converts into exactly one UTF-8 byte when converting case.
    squeaky-clean9 months ago
    Yeah If you're using narrow strings for UTF8 you're making a mistake. wstrings also are not a good representation because of the platform differences, unless you don't care about Windows in which case it's fine but still not great semantically.
    In most type definitions you cannot convert UTF8 via simple iteration because the type generally represents a code point and not a character.
    You can have a library where UTF8 characters are a native type and code points are a mostly-hidden internal element. But again, that's highly opinionated for C++.
    gpderetta9 months ago
    I'm not 100% sure what you mean by narrow string, but if you refer to std::string vs std::wstring, then std::string is perfectly fine for encoding UTF8, as that uses 8 bit code units which are guaranteed to fit in a char. On the other hand, std::wstring would be a bizarre choice for UTF8 on any platform.
    account429 months ago
    It's not guaranteed for 7-bit ASCII either because tolower/toupper are locale-dependent and with the tr_TR lowercase I (U+0049) is ı (U+0131, aka dotless i) wich encodes as two bytes in UTF-8.
    squeaky-clean9 months ago
    That's not ascii then. It's byte width compatible (to a certain degree as you point out). But it's not ascii. ascii defines 128 code points and the handling of an escape character. It doesn't handle locales.
    account429 months ago
    ASCII is an encoding, it doesn't say anything about locale. The point is that tolower/toupper is not guaranteed to be safe even if the input is 7-bit.
    gpderetta9 months ago
    I don't think there is any possibility of doing locale specific lower/upper casing in ASCII. It is really designed for (a subset of) American english.
    gpderetta9 months ago
    std::u8string, std::u16string and std::u32string are supposed to be the portable unicode string types, but a lot of machinery is missing and some that has been added has since been deprecated.
    > there's no way to implement such a module without being very opinionated about something.
    indeed! Boost.Nowide[1] is such an opinionated library.
    [1] https://www.boost.org/doc/libs/master/libs/nowide/doc/html/i...
    squeaky-clean9 months ago
    Yep, there's also ICU and utf8cpp, and many others. They all have trade-offs. So I just don't think the stdlib should cover this because there is no objectively best way to handle it.
    fluoridation9 months ago
    I know I can simply iterate. The point is that it's a function that should be included, not that it's impossible without it. It's one of the most common string operations.
    squeaky-clean9 months ago
    To me that feels like the JS community asking for left-pad or is-even in a module. Why have a dedicated function for 2 lines of code?
    And it's a huge footgun. There is no ascii type in C++. People will use the generalized tolower for UTF8 encoded in narrow strings and have issues.
    You could say the generalized tolower should support all the different width/encoding combinations and sort it out. But that's still highly opinionated as far as performance is concerned.
    Generalized string conversion is a very complex problem and you really cannot simplify it in a way that will satisfy most C++ users. Just use ICU or utf8cpp if you want to do string operations and don't care what's going on under the hood. But even then I can't recommend just 1 library, because no perfect 3rd party library exists. A perfect first party library definitely could not exist.
    fluoridation9 months ago
    >Why have a dedicated function for 2 lines of code?
    Then why does std::max() exist?
    >People will use the generalized tolower for UTF8 encoded in narrow strings and have issues.
    tolower() and toupper() work correctly on UTF-8 strings, because UTF-8 was specifically designed so that non-ASCII characters were represented by sequences of purely non-ASCII bytes.
    >Generalized string conversion is a very complex
    Hence why people who say C++ should have a tolower() that operates on strings are not asking more complex Unicode support.
    theelous39 months ago
    > There's really no excuse for the library not having simple case conversion functions that operate on strings in-place.
    Could not agree more. Any time I touch a C I want to scoop my brain out of my ear. So many simple unbelievably common operations have fifty "best" ways to do them, when they should have one happy path 99% of usecases require baked in. Nobody should ever have to seriously consider something as ridiculous as "is tolower addressable?".
    account429 months ago
    std::tolower / std::toupper are rubbish functions that can't do proper Unicode but still pull in the bloated locale machinery for what should be a simple conditional integer addition if all you care about is ASCII. Both have no valid use case and should be marked [[deprecated]] and erased from all teaching materials.
    9 months ago
    undefined
  - lmm9 months ago
    > What if your game needs to talk to a server and do some string manipulation in between requests?
    What conceivable reason would there be to ever need to do that? If the server takes commands in upper case, then have them in upper case from the start. If the server takes commands in lower case, have them in lower case from the start. If the server specifies that you need to invert the case of its response to use in the next request, find a server developed by someone not crazy.
    fluoridation9 months ago
    Case conversion is not the only string manipulation that's locale sensitive.
    lmm9 months ago
    No reasonable server API should require locale sensitive string manipulation.
    NBJack9 months ago
    [flagged]
    NBJack9 months ago
    Word censoring? Ease of use? Console commands (i.e. from Quake to minecraft)?
    wongarsu9 months ago
    Those sound exactly like the newcomer detectors GP was referring to. What you want is a case-insensitive string comparison, and outside ASCII that's not equivalent to just turning both strings to lowercase and checking equality (or doing a substring search or whatever the task requires)
    account429 months ago
    Exactly and where you want case-insensitive comparison you almost always also want other kinds of Unicode normalization.
    lmm9 months ago
    > Word censoring?
    Should only ever be needed for text from the user, and in that case, as GP said, find a way to examine it as-is, don't "convert".
    > Ease of use?
    What ease of use? When has futzing around with case ever made anything easier?
    > Console commands (i.e. from Quake to minecraft)?
    Why would those necessitate changing case?
  - barrkel9 months ago
    Nobody is thinking about converting the case of ASCII characters. To be thinking that, they are explicitly excluding most of the world's cultures from entering common names correctly. Restricting thought to ASCII is a lack of thought, not an active thought.
- zahlman9 months ago
  >If text is coming from user - then change design until its not needed to 'convert'
  In games, you can possibly get away with this. Most other people need to worry about things like string collation (locale-aware sorting) for user-supplied text.
  - makeitdouble9 months ago
    TBF, if you are caring about string collation, you're already at the entrance of the rabbit hole and probably should go down to the deep end anyway.
    I'd assume SleepyMyroslav doesn't apply to devs willing to spend weeks at time to handle all the complexity in full.
- cheema339 months ago
  > In gamedev there is simple rule: don't try to do any of that.
  I am not in gamedev, but I frequently have to develop middleware that takes in user entered data and formats it in a way that will import into a 3rd party system without errors. And that sometimes means changing the case on strings.
  In my experience as a developer, this is very very common requirement.
  Luckily I am not forced to use a low level language for any of my work. In C# I can simply do this: "hello world".ToUpper();
  - Smaug1239 months ago
    If you're putting data into a third-party system, you might want `ToUpperInvariant`, not `ToUpper`. (Just checking that you know the difference, because most people don't!)
  - crote9 months ago
    The problem is that such third-party requirements are usually wrong.
    Two decades ago some developer probably went "Yeah, obviously all names start with capital letters!", not realizing that there are in fact plenty of names which start with a lowercase letter. So they added an input validation test which checks for capitals, which meant everyone feeding that system had to format their data. A whole ecosystem grew around the format of the output of that system, and now you're suddenly rewriting the system and you run into weird and plain wrong capitalization requirements for no technical reason whatsoever.
    Alternatively, the same but start with punch cards which predate ASCII and don't distinguish between uppercase and lowercase letters.
    > In C# I can simply do this: "hello world".ToUpper()
    ... which does not work.
    Take a look at the German word "straße" (street), for example. Until very recently the "ß" character did not have an uppercase variant, so a ToUpper would convert it to "STRASSE". This is a lossy operation, as the reverse isn't true: the lowercase variant of "KONGRESSSTRASSE" (congress street) is not "kongreßstraße" - it's supposed to be "Kongressstraße".
    It can get even worse: the phrase "in Maßen" (in moderate amounts) naively has the uppercase variant "IN MASSEN" - but that means "in huge amounts"! In that case it is probably better to stick to "IN MASZEN".
    And then there's Turkish, where the uppercase variant of the letter "i" is of course "İ" rather than "I" - note the dot.
    So no, you cannot "simply" use ToUpper() / ToLower(). They might work well enough of basic ASCII for languages like English, but they have a habit of making a mess out of everything else. You're supposed to use CultureInfo.TextInfo.ToUpperCase() and explicitly specify what locale the text is in so that it can use the right converter. Which is of course essentially impossible in general-purpose text fields.
    In practice that means your options are a) giving up on the concept of uppercase/lowercase conversion and just passing it as-is, or b) accepting that you are inevitably going to be silently corrupting your data.
    neonsunset9 months ago
    > So no, you cannot "simply" use ToUpper() / ToLower(). They might work well enough of basic ASCII for languages like English, but they have a habit of making a mess out of everything else. You're supposed to use CultureInfo.TextInfo.ToUpperCase() and explicitly specify what locale the text is in so that it can use the right converter. Which is of course essentially impossible in general-purpose text fields.
    Have you ever read the documentation? https://learn.microsoft.com/en-us/dotnet/fundamentals/runtim...
    crote9 months ago
    Yes. Now try applying it to something like this very HN comment section, which is mixing words belonging to different cultures inside a single comment - and in some cases even inside the same word.
    Sure, you can now do case conversion for a specific culture, but which one?
    wruza9 months ago
    It’s a lossy operation, but it does work. By this logic jpeg and mpeg don’t work either. But were watching them videos daily.
    Yes we can simply ToUpper(). We just can’t ToUpper().ToLower(), but that’s useless cause we have the original string if we need it and fine if we don’t need it.
    account429 months ago
    The point is that what ToUpper does depends on locale AND Unicode version. This for many applications it only appears to work until it will fail spectacularly in production.
  - Netch9 months ago
    > In C# I can simply do this: "hello world".ToUpper();
    Hmm still actual: https://www.moserware.com/2008/02/does-your-code-pass-turkey...
    neonsunset9 months ago
    > 2008
    This is completely irrelevant because culture-sensitive case conversion relies on ICU/NLS.
    Netch9 months ago
    But at least a programmer shall be aware to call it (whatever API is used).
  - pjmlp9 months ago
    Note that the correct way to do that in C# would be to pass an instance of CultureInfo.
- jameshart9 months ago
  I don’t think you can say this is universally known in ‘game dev’. In fact just last week I stumbled using the UI in a game that let me enter a name for something, which it then displayed in uppercase.
  Game UI is the place I’d expect to most likely come across horrific abuses of localization precisely because game UI is such a cobbled together layer of hacks on hacks.
- 9 months ago
  undefined
- beeboobaa39 months ago
  > There are major issues just to show user back what he entered! Because the font for editing and displayed text could be different. Not even mentioning RTL and other issues.
  Your web browser is doing it right now as you are reading this comment.
  - rty329 months ago
    And web development is not game development? And chances are that games don't ship chromium with them?
    moron4hire9 months ago
    Actually...
    https://github.com/baikety/uWebKit https://zenfulcrum.com/browser/docs/Readme.html https://github.com/roydejong/chromium-unity-server
    There are a lot more, I just got bored at 3.
    And it's not just Unity. Several exist for Unreal as well.
    Why? Specifically because 2D layout and text rendering suck so much in game engines. What's ~50MB matter when you're shipping several GB of game assets?
blenderob9 months ago
It is issues like this due to which I gave up on C++. There are so many ways to do something and every way is freaking wrong!
An acceptable solution is given at the end of the article:
> If you use the International Components for Unicode (ICU) library, you can use u_strToUpper and u_strToLower.
Makes you wonder why this isn't part of the C++ standard library itself. Every revision of the C++ standard brings with itself more syntax and more complexity in the language. But as a user of C++ I don't need more syntax and more complexity in the language. But I do need more standard library functions that solves these ordinary real-world programming problems.
- bayindirh9 months ago
  I don't think it's a C++ problem. You just can't transform anything developed in "ancient" times to unicode aware in a single swoop.
  On the other hand, libicu is 37MB by itself, so it's not something someone can write in a weekend and ship.
  Any tool which is old enough will have a thousand ways to do something. This is the inevitability of software and programming languages. In the domain of C++, which has a size mammoth now, everyone expects this huge pony to learn new tricks, but everybody has a different idea of the "new tricks", so more features are added on top of its already impressive and very long list of features and capabilities.
  You want libICU built-in? There must be other folks who want that too. So you may need to find them and work with them to make your dream a reality.
  So, C++ is doing fine. It's not that they omitted Unicode during the design phase. Unicode has arrived later, and it has to be integrated by other means. This is what libraries for.
  - zahlman9 months ago
    >You just can't transform anything developed in "ancient" times to unicode aware in a single swoop.
    Even for Python it took well over a decade, and people still complain about the fact that they don't get to treat byte-sequences transparently as text any more - as if they want to wrestle with the `basestring` supertype, getting `UnicodeDecodeError` from an encoding operation or vice-versa, trying to guess the encoding of someone else's data instead of expecting it to be decoded on the other side....
    But in C++ (and in C), you have the additional problem that the 8-bit integer type was named for the concept of a character of text, even though it clearly cannot actually represent any such thing. (Not to mention the whole bit about `char` being a separate type from both `signed char` and `unsigned char`, without defined signedness.)
  - pornel9 months ago
    Being developed in, and having to stay compatible with, ancient times is a real problem of C++.
    The now-invalid assumptions couldn't have been avoided 50 years ago. Fixing them now in C++ is difficult or impossible, but still, the end result is a ton of brokenness baked into C++.
    Languages developed in the 21st century typically have some at least half-decent Unicode support built-in. Unicode is big and complex, but there's a lot that a language can do to at least not silently destroy the encoding.
    cm21879 months ago
    That explains why there are two functions, one for ascii and one for unicode. That doesn't explain why the unicode functions are hard to use (per the article).
    BoringTimesGang9 months ago
    Because human language is hard to boil down to a simple computing model and the problem is underdefined, based on naive assumptions.
    Or perhaps I should say naïve.
    cm21879 months ago
    Well pretty much every other more recent language solved that problem.
    kccqzy9 months ago
    Almost no programming language, perhaps other than Swift, solved that problem. Just use the article's examples as test cases. It's just as wrong as the C++ version in the article, except it's wrong with nicer syntax.
    zahlman9 months ago
    Python's strings have uppercase, lowercase and case-folding methods that don't choke on this. They don't use UTF-16 internally (they can use UCS-2 for strings whose code points will fit in that range; while a string might store code points from the surrogate-pair range, they're never interpreted as surrogate pairs, but instead as an error encoding so that e.g. invalid UTF-8 can be round-tripped) so they're never worried about surrogate pairs, and it knows a few things about localized text casing:
    >>> 'ß'.upper() 'SS' >>> 'ß'.lower() 'ß' >>> 'ß'.casefold() 'ss'
    There are a lot of really complicated tasks for Unicode strings. String casing isn't really one of them.
    (No, Python can't turn 'SS' back into 'ß'. But doing that requires metadata about language that a string simply doesn't represent.)
    crote9 months ago
    But that's wrong. The uppercase for "in Maßen" ("in moderate amounts") is not "IN MASSEN" ("in Massen", meaning "in massive amounts").
    kccqzy9 months ago
    Still breaks on, for example, Turkish i vs İ. It's impossible to do correctly without language information.
    > (No, Python can't turn 'SS' back into 'ß'. But doing that requires metadata about language that a string simply doesn't represent.)
    Yes that's my point. Because in typical languages strings don't store language metadata, this is impossible to do correctly in general.
    zahlman9 months ago
    I'm not seeing anything in the Swift documentation about strings carrying language metadata, either, though?
    kccqzy9 months ago
    This lowercase function takes a locale argument https://developer.apple.com/documentation/foundation/nsstrin...
    It looks like an old NSString method that's available in both Obj-C and Swift.
    The casefold function is even older than that. https://developer.apple.com/documentation/foundation/nsstrin... Its documentation specifically includes a discussion of the Turkish İ/I issue.
    tedunangst9 months ago
    But that's wrong. The upper case for ß is ẞ.
    cm21879 months ago
    C#'s "ToUpper" takes an optional CultureInfo argument if you want to play around with how to treat different languages. Again, solved problem decades ago.
    account429 months ago
    This is not a locale issue, it's a Unicode version issue. Which hightlights another problem with adding this to the base standard library.
    IncreasePosts9 months ago
    That was only adopted in Germany like 7 years ago!
    kccqzy9 months ago
    Well languages and conventions change. The € sign was added not that long ago and it was somewhat painful. The Chinese language uses a single character to refer to chemical elements so when IUPAC names new elements they will invent new characters. Etc.
    extraduder_ire9 months ago
    Does unicode have space set aside for those new symbols to slot into? I know it's very rare, but it could get messy.
    account429 months ago
    Unicode is already messy. Chinese characters especially so due to han unificiation.
    Towaway699 months ago
    Isn't uppercase for ß just ß - i.e. it's its own uppercase character?
    bratwurst30009 months ago
    there shouldn’t be an uppercase version of ß because there is no word in the german language that uses it as the first letter. the german language didnt think of allcaps. please correct me if I am wrong. If written in uppercase it should be converted to SZ or the new uppercase ß…. which my iphone doesn’t have… and converting anything to uppercase SS isn’t something germany wants …
    account429 months ago
    > there shouldn’t be an uppercase version of ß because there is no word in the german language that uses it as the first letter. the german language didnt think of allcaps.
    Allcaps (and smallcaps) has always existed in signage everywhere. Before the computing age, letters where just arbitrary metal stamps -- and just whatever you could draw before that. Historically, language was not as standardized as it is today.
    Towaway699 months ago
    I don’t think that Germany wants a capital ß or the German language requires one rather technology needs one to dot the eyes and cross the tees.
    account429 months ago
    Not generally no, but some applications used it that way because of ambiguity of upppercasing ß to SS - which is why ẞ was added.
    Towaway699 months ago
    On the other hand, the German language has existed for several hundred years without having a capital ß but now it needs one?
    True capitalisation has always existed but even that didn’t seem to have required a capital ß - why now?
    tialaramex9 months ago
    Rust will cheerfully:
    assert_eq!("ὀδυσσεύς", "ὈΔΥΣΣΕΎΣ".to_lowercase());
    [Notice that this is in fact entirely impossible with the naive strategy since Greek cares about position of symbols]
    Some of the latter examples aren't cases where a programming language or library should just "do the right thing" but cases of ambiguity where you need locale information to decide what's appropriate, which isn't "just as wrong as the C++ version" it's a whole other problem. It isn't wrong to capitalise A-acute as a capital A-acute, it's just not always appropriate depending on the locale.
    account429 months ago
    Is this
    assert_eq!("\u1F41δυσσεύς", "ὈΔΥΣΣΕΎΣ".to_lowercase());
    or
    assert_eq!("\u03BF\u0314δυσσεύς", "ὈΔΥΣΣΕΎΣ".to_lowercase());
    For display it doesn't matter but most other applications really want some kind of normalizatin which does much much more so having a convenient to_lowercase() doesn't buy you as much as you think and can be actively misleading.
    MBCook9 months ago
    So what?
    That doesn’t prevent adding a new function that converts an entire string to upper or lowercase in a Unicode aware way.
    What would be wrong with adding new correct functions to the standard library to make this easy? There are already namespaces in C++ so you don’t even have to worry about collisions.
    That’s the problem I see. It’s fine if you have a history of stuff that’s not that great in hindsight. But what’s wrong with having a better standard library going forward?
    It’s not like this is an esoteric thing.
    wakawaka289 months ago
    The reason that wasn't done is because Unicode is not really in older C++ standards. I think it may have been added to C++23 but I am not familiar with that. There are many partial solutions in older C++ but if you want to do it well then you need to get a library for it from somewhere, or else (possibly) wait for a new standard.
    Unicode and character encodings are pretty esoteric. So are fonts. The stuff is technically everywhere and fundamental, but there are many encodings, technical details, etc. And most programmers only care about one language, or else only use UTF-8 with the most basic chars (the ones that agree with ASCII). That isn't terrible. You only need what you actually need. Most programs don't strictly have to be built for multiple random languages, and there is kind of a standard methodology to learn before you can do that.
    9 months ago
    undefined
    account429 months ago
    No, strong backwards compatiblity a real strength of C++. In fact, it's probably it's main strength these days.
  - relaxing9 months ago
    It’s been 30 years. Unicode predates C++98. Java saw the writing on the wall. There’s no excuse.
    bayindirh9 months ago
    > There’s no excuse.
    I politely disagree. None of the programming languages which started integrating Unicode was targeting from bare metal to GUI, incl. embedded and OS development at the same time.
    C++ has a great target area when compared to other programming languages. There are widely used libraries which compile correctly on PDP-11s, even if they are updated constantly.
    You can't just say "I'll be just making everything Unicode aware, backwards compatibility be damned, eh".
    blenderob9 months ago
    But we don't have to make everything Unicode aware. Backward compatibility is indeed very important in C++. Like you rightly said, it still has to work for PDP-11 without breaking anything.
    But the C++ overlords could always add a new type that is Unicode-aware. Converting one Unicode string to another is a purely in-memory, in-CPU operation. It does not need any I/O and it does not need any interaction with peripherals. So one can dream that such a type along with its conversion routines could be added to an updated standard library without breaking existing code that compiles correctly on PDP-11s.
    bayindirh9 months ago
    > Converting one Unicode string to another is a purely in-memory, in-CPU operation.
    ...but it's a complex operation. This is what libICU is mostly for. You can't just look-up a single table and convert a string to another like you work on ASCII table or any other simple encoding.
    Germans have their ß to S (or capital ß depending on the year), Turkish has ı/I and i/İ pairs, and tons of other languages have other rules.
    Esp, this I/ı and İ/i pairs break tons of applications in very unexpected ways. I don't remember how many bugs I reported, and how many workarounds I have implemented in my systems.
    Adding a type is nice, but the surrounding machinery is so big, it brings tons of work with itself. Unicode is such a complicated system, that I read that even you need two UTF-16 characters (4 bytes in total) to encode a single character. This is insane (as in complexity, I guess they have their reasons).
    SAI_Peregrinus9 months ago
    > Unicode is such a complicated system, that I read that even you need two UTF-16 characters (4 bytes in total) to encode a single character. This is insane (as in complexity, I guess they have their reasons).
    Because there are more than 65,535 characters. That's just writing systems, not Unicode's fault. Most of the unnecessary complexity of Unicode is legacy compatibility: UTF-16 & UTF-32 are bad ideas that increase complexity, but they predate UTF-8 which actually works decently well so they get kept around for backwards compatibility. Likewise with the need for multiple normalization forms.
    numpad09 months ago
    It's because Unicode don't allow for language switching.
    It takes up to eight bytes per character in Unicode if you want to support both Chinese and Japanese in a single font using IVS(and I don't think there's any font that actually supports this).
    AFAICS(As far as I can search), Simplified(PRC) and Traditional(Taiwan) Chinese encoding are respectively called GB2312 and Big5, and they're both two byte encodings with good practical coverage. Same applies for Japanese Shift_JIS. If e.g. :flag_cc: were allowed to be used as start-of-language marker, one could theoretically cut that back down to two bytes per character without losing much and actually improving language supports.
    account429 months ago
    The number of characters is not the problem, the mess due to legacy compatibility is - case folding and normaltization could be much simpler if the codepoints were laid out with that in mind. Also the fact the Unicode can't make up its mind if it wants to encode glyphs (turkish I and i, han unification) or semantic characters (e.g. cyrillic vs. latin letters) or just "ideas" (emojis).
    bayindirh9 months ago
    I mean, I already know some Unicode internals and linguistics (since I developed a language-specific compression algorithm back in the day), but I have never seen a single character requiring four bytes (and I know Emoji chaining for skin color, etc.).
    So, seeing this just moved the complexity of Unicode one notch up in my head, and I respect the guys who designed and made it work. It was not whining or complaining of any sort. :)
    fluoridation9 months ago
    Cuneiform codepoints are 17 bits long. If you're using UTF-16 you'll need two code units to represent a character.
    gpderetta9 months ago
    you also need two UTF16 code units for plain emojis.
    TorKlingberg9 months ago
    Lots of emoji are outside the Basic Multilingual Plane and need 4 bytes in UTf-8 and UTF-16. That's without going into skin color and other modifiers and combinations.
    9 months ago
    undefined
    account429 months ago
    > Germans have their ß to S (or capital ß depending on the year)
    FYI, it's never S. If there is no better option then SS and ss are the proper capital and lowercase substitutions.
    blenderob9 months ago
    Thanks for the reply! Really appreciate the time you have taken to write down a thoughtful reply.
    bayindirh9 months ago
    No problems! If you want a slightly longer write-up, here's a classic I constantly share with people:
    https://blog.codinghorror.com/whats-wrong-with-turkey/
    wakawaka289 months ago
    Java was built from scratch as a heavy language with a whole portability layer that C++ does not have. Also, libraries have been around to do this stuff in C++ but maybe some people saw it better to not require C++ to support Unicode, presumably.
    Netch9 months ago
    > There’s no excuse.
    Until mid-2000s there was no certainty Unicode will eventually defeat competitors. In real it havenʼt fully yet - GB2312 and Tron are still locally prevailing, and IBM still jogs with EBCDIC. But at its early times nobody was reasonably sure, and Java attempt could have failed as well. (More so Java approach for UCS-2 was wrong - already commented nearby.)
    gpderetta9 months ago
    Java ended up picking UCS-2 and getting screwed.
    throwaway20379 months ago
    Pretty much all Unicode early adopters went for 16-bit chars. Qt and Win32 API are another pair.
    gpderetta9 months ago
    Indeed, ICU as well, and then they all moved to UTF-16, which, again. in the long term lost to UTF-8. My point is that committing on a specific Unicode design 30 years ago was not, in retrospect, necessarily a good idea.
    By not committing to UCS-2 early C++ left the road open to UTF-8. I'll concede that UTF8 has risen as the clear winner for more than a decade and C++ is well past the point that it should have at least basic builtin support. The problem is that there is at least one important C++ platform that only very recently added full support for the encoding in their native API.
    account429 months ago
    Qt really has no excuse for still using 16-bit characters since unlike the other two they have had multiple ABI breaks since then.
    throwaway20379 months ago
    "no excuse" -- I would respectfully disagree here. There are lots of very smart people who have worked on Qt. Really, some insanely good C++ programmers have worked on that project. I have no doubt that they have discussed changing class QString to use UTF-8 internally. To be clear, probably QChar would also need to change, or a new class (QChar8?) would be needed, in parallel to QChar. I guess they concluded the API breakage would be too severe. I assume Java and Win32/DotNet decided the same. Finally, you can Google for old mailing list discussions about QString using UTF-16. Many before have asked "can we just change to UTF-8?".
    account429 months ago
    Ah yes, appeal to authority. No better way to admit that you are talking out of your arse.
    nitwit0059 months ago
    Java embraced Unicode, and ended up with a mess as Unicode changed underneath it.
    You can actually end up in a cleaner state in C++, as there is no obligation to use the standard library string classes, but it's pretty much required in Java.
    account429 months ago
    Java has 16-bit character types. It is in no way better at modern Unicode than C++ while being needlessly less efficient for mostly-ASCII text like XML-like markup.
  - akira25019 months ago
    > libicu is 37MB by itself, so it's not something someone can write in a weekend and ship.
    Isn't that mostly just from tables derived from the Unicode standard?
  - ectospheno9 months ago
    > Any tool which is old enough will have a thousand ways to do something.
    Only because of the strange desire of programmers to never stop. Not every program is a never ending story. Most are short stories their authors bludgeon into a novel.
    Programming languages bloat into stupidity for the same reason. Nothing is ever removed. Programmers need editors.
    fluoridation9 months ago
    So how do you design a language that accommodates both the people who need a codebase to be stable for decades and the people who want the bleeding edge all the time, backwards compatibility be damned?
    the_gorilla9 months ago
    You don't. Any language that tries to do both turns into an unusable abomination like C++. Good languages are stable and the bleeding edge is just the "new thing" and not necessarily better than the old thing.
    fluoridation9 months ago
    C++ doesn't try to do that. It aims to remain as backwards compatible as possible, which is what the GP is complaining about.
- pistoleer9 months ago
  > There are so many ways to do something and every way is freaking wrong!
  That's life! The perfect way does not exist. The best you can do is be aware of the tradeoffs, and languages like C++ absolutely throw them in your face at every single opportunity. It's fatiguing, and writing in javascript or python allows us to uphold the facade that everything is okay and that we don't have to worry about a thing.
  - pornel9 months ago
    JS and Python are still old enough to have been created when Unicode was in its infancy, so they have their own share of problems from using UCS-2 (such as indexing strings by what is now a UTF-16 code unit, rather than by a codepoint or a grapheme cluster).
    Swift has been developed in the modern times, and it's able to tackle Unicode properly, e.g. makes distinction between codepoints and grapheme clusters, and steers users away from random-access indexing and having a single (incorrect) notion of a string length.
  - 9 months ago
    undefined
- Muromec9 months ago
  Well, the only time you can do str lower where unicode locale awareness will be a problem is when you do it on the user input, like names.
  How about you just dont? If it's a constant in your code, you probably use ASCII anyway or can do a static mapping. If it's user user input -- just don't str lower / str upper it.
  - account429 months ago
    Yes, except when it is not your choice. If the requirements are to display some strings in lower/uppercase then you need to find a way to do that. That doesn't have to be using the standard library though.
- pjmlp9 months ago
  Because it is a fight to put anything on a ISO managed language, and only the strongest persevere long enough to make it happen.
  Regardless of what ISO language we are talking about.
  - account429 months ago
    If anything it should be harder to add things to the language. Too many new additions have been half-arsed like and needed to be changed or deprecated soon after.
  - gpderetta9 months ago
    Yes, significantly smaller libraries had an hard time getting onto the standard. Getting the equivalent of ICU would be almost impossible. And good luck keeping it up to date.
- account429 months ago
  > Makes you wonder why this isn't part of the C++ standard library itself.
  Because the C++ standard library cares about binary size and backwards compatiblity, both of with are incompatible with a full Unicode implementation. Putting this in the stdlib means everyone has to pay for it even when you don't need it.
  Libraries are fine, not everything needs to be defined by the language itself.
- Netch9 months ago
  > Makes you wonder why this isn't part of the C++ standard library itself.
  Plainly no need if there is a separate easily attachable library (and with permissible license). What C++ had to do - provide character (char{8,16,32}_t) and string types - it has done.
- Night_Thastus9 months ago
  As a C++ dev, I have never run into the problem the post is describing. Upper and lowercase conversion has always worked just fine. Though then again, I don't fiddle with mixed unicode and non-unicode situations.
- wslh9 months ago
  Me too, how is case conversion perfectly done in modern languages such as Zig [1], Rust, or Swift?
  [1] Ended up looking at https://github.com/JakubSzark/zig-string
  - steveklabnik9 months ago
    In Rust, the APIs are clear if they're ASCII only or unicode aware.
    https://doc.rust-lang.org/stable/std/primitive.str.html#meth...
    > ‘Lowercase’ is defined according to the terms of the Unicode Derived Core Property Lowercase.
    https://doc.rust-lang.org/stable/std/primitive.str.html#meth...
    > ASCII letters ‘A’ to ‘Z’ are mapped to ‘a’ to ‘z’, but non-ASCII letters are unchanged.
    Now, "perfectly" is very strong. For example, the Turkish i problem. That is not solved. But 99% of Unicode stuff is handled correctly by default.
    oguz-ismail9 months ago
    > 99% of Unicode stuff
    Does that include context-dependent conversion rules like o'reilly -> O'Reilly?
    PaulDavisThe1st9 months ago
    that is neither up-casing nor-downcasing, but (de)capitalization, which is a significantly more complex task (which ultimately requires up- or down-casing, but a whole lot more before then).
    oguz-ismail9 months ago
    So it doesn't. If Unicode doesn't cover non-trivial forms of case-folding, 99% of Unicode doesn't mean anything.
    PaulDavisThe1st9 months ago
    I am not aware of a Unicode concept of "the latin letter o followed by an apostrophe followed by another latin letter". Unicode would identify the glyphs for such a concept, but I don't see how Unicode is involved in this in anyway as far the process of deciding what "capitalized o'reilly" means.
    steveklabnik9 months ago
    Sort of, see the Greek example elsewhere in this thread. I don’t think that specific situation is part of Unicode though.
- hoseja9 months ago
  >Makes you wonder why this isn't part of the C++ standard library itself.
  Because then every change in Unicode would need to be standardized in C++ as well. Yup. Can't have Unicode due to committee friction.
- dennis_jeeves29 months ago
  > There are so many ways to do something and every way is freaking wrong!
  Stroustrup, laugheth!
- BoringTimesGang9 months ago
  >It is issues like this due to which I gave up on C++. There are so many ways to do something and every way is freaking wrong!
  These are mostly unicode or linguistics problems.
  - tralarpa9 months ago
    The fact that the standard library works against you doesn't help (to_lower takes an int, but only kind of works (sometimes) correctly on unsigned char, and wchar_t is implicitly promoted to int).
    BoringTimesGang9 months ago
    to_lower is in the std namespace but is actually just part of the C89 standard, meaning it predates both UTF8 and UTF16. Is the alternative that it should be made unusable, and more existing code broken? A modern user has to include one of the c-prefix headers to use it, already hinting to them that 'here be dragons'.
    But there are always dragons. It's strings. The mere assumption that they can be transformed int-by-int, irrespective of encoding, is wrong. As is the assumption that a sensible transformation to lower case without error handling exists.
    account429 months ago
    > Is the alternative that it should be made unusable, and more existing code broken?
    It should be marked [[deprecated]], yes. There is no good reason to use std::tolower/toupper anywhere - they can neither do unicode properly nor are they anywhere close to efficient for ASCII. And their behavior depends on the process-global locale.
- 9 months ago
  undefined
appointment9 months ago
The key takeaway here is that you can't correctly process a string if you don't what language it's in. That includes variants of the same language with different rules, eg en-US and en-UK or es-MX and es-ES.
If you are handling multilingual text the locale is mandatory metadata.
- zarzavat9 months ago
  Different parts of a string can be in different languages too[1].
  The lowercase of "DON'T FUSS ABOUT FUSSBALL" is "don't fuss about fußball". Unless you're in Switzerland.
  [1] https://en.wikipedia.org/wiki/Code-switching
  - schoen9 months ago
    Probably "don't fuss about Fußball" for the same reasons, right?
  - thiht9 months ago
    I thought the German language deprecated the use of ß years ago, no? I learned German for a year and that's what the teacher told us, but maybe it's not the whole story
    472828479 months ago
    Incorrect. ẞ is still a thing.
    CamperBob29 months ago
    Going by what you and the grandparent wrote, it's not just a thing, but two different things: ẞ ß
    It is probably time for an Esperanto advocate to show up and set us all straight.
    selenography9 months ago
    > set us all straight.
    Se fareblus oni, jam farintus oni. (It definitely won't happen on an echo-change day like today, either. ;))
    Contra my comrade's comment, Esperanto orthography is firmly European, and so retains European-style casing distinctions; every sound thus still has two letters -- or at least two codepoints.
    (There aren't any eszettesque bigraphs, but that's not saying much.)
    D-Coder9 months ago
    Pri kio vi parolas? En Esperanto, unu letero egalas unu sonon.
    What are you talking about? In Esperanto, one letter equals one sound.
    TZubiri9 months ago
    Germans run Uber Long Term Support dialects
    Kwpolska9 months ago
    The Swiss have dropped ß, but it's still a thing in Germany or Austria.
- jeroenhd9 months ago
  Language is just part of the problem. Unicode lets you store text as entered, but what you do with that text completely depends on what your problem domain is. When you're writing software to validate that the name on someone's ID matches that on a ticket, you're probably going to normalise that name to your (customer's) locale rather than render each name in the locale it was originally written in. As long as you keep your locale settings consistent and don't do bad stuff like "iterate over characters and individually transform them", you're probably fine, unless your problem domain calls for something else.
  If you're printing a name, you're probably printing the name for the current user, not for the person who entered it at some point. If you're going to try to convert back like that, you also need to store a timestamp with every string in case a language changes its rules (such as permitting ẞ instead of SS when capitalising ß). And even then, someone might intend to use the new spelling rules, or they might not, who knows!
  This article probably boils down to "programmers don't realise graphemes aren't characters and characters aren't bytes even though they usually are in US English". The core problem, "text processing looks easy as long as you only look at your own language", is one that doesn't just affect computers.
  Your best bet is to just avoid the entire problem by not processing input further than basic input sanitisation, such as removing whitespace prefixes/suffixes and maybe stripping out invalid unicode so it can't be used as a weird stored attack.
- thayne9 months ago
  Not quite.
  islower is actually supposed to account for the user's "locale", which includes their language.
  The key takeway is that lowercasing a string needs to be done on the whole string, not individual characters, even if std::string had a way to iterate over codepoints instead of bytes (or code units, in the case of wstring).
  And there isn't a standard way to do that, you either meed to use a platform specific API, like the windows function mentioned, or use a library like ICU.
vardump9 months ago
As always, Raymond is right. (And as usually, I could guess it's him before even clicking the link.)
That said, 99% time when doing upper- or lowercase operation you're interested just in the 7-bit ASCII range of characters.
For the remaining 1%, there's ICU library. Just like Raymond Chen mentioned.
- crazygringo9 months ago
  > That said, 99% time when doing upper- or lowercase operation you're interested just in the 7-bit ASCII range of characters.
  I think it's more the exact opposite.
  The only times I'm dealing with 7-bit ASCII is for internal identifiers like variable names or API endpoints. Which is a lot of the time, but I can't ever think of when I've needed my code to change their case. It might literally be never.
  On the other hand, needing to switch between upper, lower, and title case happens all the time, always with people's names and article titles and product names and whatnot. Which are never in ASCII because this isn't 1990.
  - bigstrat20039 months ago
    > Which are never in ASCII because this isn't 1990.
    This is a very silly statement. I'm willing to believe that you have lots of cases where those things are outside the ASCII range. Perhaps even most of the cases, depending on where you live. But I do not believe for one second that it never happens.
    crazygringo9 months ago
    Never stored in ASCII, never limited to ASCII. They're UTF-8, usually.
    If somebody's name happens to fit into ASCII that's irrelevant because it's not guaranteed, so you can never blindly do an ASCII case conversion.
    For text data meant for users, I literally cannot remember the last time I used a string in ASCII format as opposed to UTF-8 (or UTF-16 in JS). It's certainly over a decade ago.
    So yes, when I say never, I literally mean never. Nothing "very silly" about it, sorry.
    (Again, excepting identifiers, where case conversion is not generally applicable.)
  - hinkley9 months ago
    And you could argue that if the internal identifiers need to be capitalized or lower-cased, you've already lost.
    On an enterprise app these little string manipulations are a drop in the bucket. In a game they might not be. Sort that stuff out at compile time, or commit time.
    account429 months ago
    You can't always control the case you get but often you can not care about anything outside ASCII. Scripts and configuration or text-based data formats are common examples.
- sebstefan9 months ago
  Yes please, keep making software that mangles my actual last name at every step of the way. 99% of the world loves it when you only care about the USA.
  - Muromec9 months ago
    If it needs to uppercase names it probably interfaces with something forsaken like Sabre/Amadeus that only understands ASCII anyway.
    The real problem is accepting non-ASCII input from user where you later assume it's ASCII-only and safe to bitfuck around.
    sebstefan9 months ago
    From experience anything banking adjacent will usually fuck it up as well
    For some reason they have a hard-on for putting last names in capital letters and they still have systems in place that use ASCII
    Muromec9 months ago
    If it uses ASCII anyway, what's the problem then? Don't accept non-ASCII user input.
    sebstefan9 months ago
    First off: And exclude 70% of the world?
    Usually they'll accept it, but some parts of the backend are still running code from the 60's.
    So you get your name rendered properly on the web interface, and most core features, but one day you're wandering off from the beaten path, by, like, requesting some insurance contract, and you'll see your name at the top with some characters mangled, depending on what your name's like. Mine is just accented latin characters so it usually drops the accents ; not sure how it would work if your name was in an entirely different alphabet
    Muromec9 months ago
    >First off: And exclude 70% of the world?
    Guess what, I'm part of this 70% and I also work in a bank and I know exactly how.
    Not a single letter in my name (any of them) can be represented with ASCII. When it is represented in UTF-8, most of the people who have to see it can't read it anyway.
    So my identity document issued by the country which doesn't use Latin alphabet includes ASCII-representation of my name in addition to canonical form in Ukrainian Cyrillic. That ASCII-rendering is happily accepted by all kinds of systems that only speak ASCII.
    People still can't pronounce it and it got misspelled like yesterday when dictated over the phone.
    Now regarding the accents, it's illegal to not support them per GDPR (as per case law, discussed here few years ago).
    numpad09 months ago
    Why can't these people understand that that 70% of the world consider ASCII to be "the computer language", not English, and UTF-8 to be "whatever soup that only works inside files and forms and can't be program manipulated"?
    Maybe it needs to be communicated more often, like way more often, until it sticks.
    Muromec9 months ago
    Well, it's much easier to understand the difference when one and another are using different alphabets.
    account429 months ago
    You are not being excluded just because you need to use a romanized version of your name. Clear example of a first world problem.
    sebstefan9 months ago
    >first world problem
    ? The more first world you are the more your alphabet is taken into consideration
    Hint: You use the word """romanized"""
    InfamousRece9 months ago
    Some systems are still using EBCDIC.
  - account429 months ago
    Cool, I will.
  - MajimasEyepatch9 months ago
    It’s totally reasonable to assume your users are in the US if your business only sells to people in the US. I work in the health insurance sector; there’s absolutely no chance my company ever sells these products internationally. We can’t even sell them in every state.
    davidcbc9 months ago
    It's not reasonable to assume that users in the US have names that only use 7-bit ASCII
    account429 months ago
    It's reasonable to assume that all users can deal with having to encode their names in 7-bit ASCII. Otherwise you might as well demand that computer systems need to support arbitrary drawings in the name field at which point you might as well not have a name field at all because even most humans won't be able to deal with what you want to put in there.
    davidcbc9 months ago
    Nice slippery slope you've got there
    bigstrat20039 months ago
    It actually is. That covers the vast, vast majority of people in the US.
    saagarjha9 months ago
    That is not very reasonable, is it?
    MajimasEyepatch9 months ago
    It is if all your customers are in the US.
- fhars9 months ago
  No, when you are doing string manipulation, you are almost never interestet in just the seven bit ASCII range, as there is almost no language that can be written using just that.
  - vardump9 months ago
    > as there is almost no language that can be written using just that.
    99% of use cases I've seen have nothing to do with human language.
    1% human language case that is needs to be handled properly using a proper Unicode library.
    Your mileage (percentages) may vary depending on your job.
    kergonath9 months ago
    Right. That’s why I still get mail with my name mangled and my street name barely recognisable. Because I’m in the 1%. Too bad for me…
    In all seriousness, though, in the real world ASCII works only for a subset of a handful of languages. The vast majority of the population does not read or write any English in their day to day lives. As far as end users are concerned, you should probably swap your percentages.
    ASCII is mostly fine within your programs like the parser you mention in your other comment. But even then, it’s better if a Chinese user name does not break your reporting or logging systems or your parser, so it’s still a good idea to take Unicode seriously. Otherwise, anything that comes from a user or gets out of the program needs to behave.
    vardump9 months ago
    I said use a Unicode library if input data is actual human language. Which names and addresses are.
    99% case being ASCII data generated by other software of unknown provenance. (Or sometimes by humans, but it's still data for machines, not for humans.)
    kergonath9 months ago
    I am really not sure about this 99%. A lot of programs deal with quite a lot of user-provided data, which you don’t control.
    account429 months ago
    User-provided data, yes, but also data where you can treat non-ASCII bytes as garbage in -> garbage out. E.g. the config file might be typed by a human but if you need to support case-insensitive keys you still don't need to worry about Unicode.
    kergonath9 months ago
    Exactly. But in this case, don’t try to upper-case or otherwise transform anything.
    Factory9 months ago
    "The vast majority of the population does not read or write any English in their day to day lives." This is doubtful: https://en.wikipedia.org/wiki/List_of_languages_by_total_num... While English speakers are not a majority, it is the most popular language. And one should also note that given English is the lingua franca of programming, I'd suspect that English as a second language is actually a majority for programmers. So any code that deals solely with programmers as users can easily just use standard ASCII as default, and never see any problems.
    kergonath9 months ago
    > "The vast majority of the population does not read or write any English in their day to day lives." This is doubtful: https://en.wikipedia.org/wiki/List_of_languages_by_total_num... While English speakers are not a majority, it is the most popular language.
    That is the number of English-speaking people, as in people who can speak English. Not necessarily people who use it every day. In any case, ASCII only works for a subset of even English if you ignore all loan words and diacritics in things like proper names.
    > So any code that deals solely with programmers as users can easily just use standard ASCII as default, and never see any problems.
    That would not be much code at all, given that most code deals with user interfaces or user-provided data. That is the point: it’s not because the code is in basic English simplified enough to fit in ASCII that you can ignore Unicode and don’t need to consider text encoding.
    numpad09 months ago
    > That’s why I still get mail with my name mangled
    Which is why you always type out addresses in ASCII representations in any foreign transactions even if it's not going to match your identity documents, unless the other party specifically demands it in UTF-8 and insists that they can handle it.
    > it’s better if a Chinese user name does not break your reporting or logging systems
    You should not be just casually dumping Chinese usernames into logs without warnings, in fact, you should not be using Chinese characters for usernames at all. Lots of Chinese online services exclusively use numeric IDs and e-mails for login IDs. "Usernames in natural human language" is a valid concept only in ASCII cultural sphere.
    kergonath9 months ago
    > Which is why you always type out addresses in ASCII representations in any foreign transactions even if it's not going to match your identity documents, unless the other party specifically demands it in UTF-8 and insists that they can handle it.
    That is not always possible and the translation from local writing system to ASCII is often not unique and ambiguous. There really is no excuse for this sort of thinking. Even American programmers have to realise at some point that programs serve some purpose and that their failure to represent how the world works is just that: a failure. There is no excuse for programs to not support UTF-8 from user input to any output, including all the processing in between.
    Muromec9 months ago
    Who and why still tries to lowercase/uppercase names? Please tell them to stop.
    kergonath9 months ago
    Hell if I know. I don’t know what kind of abomination e-commerce websites run on their backend, I just see the consequences.
    9dev9 months ago
    It's funny how software developers live in bubbles so much. Whether you deal with human language a lot or almost not at all depends entirely on your specific domain. Anyone working on user interfaces of any kind must accommodate for proper encoding, for example; that includes pretty much every line-of-business app out there, which is a lot of code.
    elpocko9 months ago
    Every search feature everywhere has to be case-insensitive or it's unusable. Search seems like a pretty ubiquitous feature in a lot of software, and has to work regardless of locale/encoding.
    account429 months ago
    Search needs a whole lot more normalization than just case folding.
    elpocko9 months ago
    Okay.
    inexcf9 months ago
    Why do you need upper- or lowercase conversion in cases that have nothing to do with human language?
    vardump9 months ago
    Here's an example. Hypothetically say you want to build an HTML parser.
    You might encounter tags like <html>, <HTML>, <Html>, etc., but you want to perform a hash table lookup.
    So first you're going to normalize to either lower- or uppercase.
    ARandumGuy9 months ago
    Converting string case is almost never something you want to do for text that's displayed to the end user, but there are many situations where you need to do it internally. Generally when the spec is case insensitive, but you still need to verify or organize things using string comparison.
    inexcf9 months ago
    Ah, i see, we disagree on what is "human language". An abbreviation like HTML and it's different capitalisations to me sound a lot like a feature of human language.
    recursive9 months ago
    Is this a serious argument? Humans don't directly use HTML to communicate with each other. It's a document markup language rendered by user agents, developed against a specification.
    tannhaeuser9 months ago
    Markup languages and SGML in particular absolutely are designed for digital text communication by humans and to be written using plain text editors; it's kindof the entire point of avoiding binary data constructs.
    And to GP, SGML/HTML actually has a facility to define uppercasing rules beyond ASCII, namely the LCNMSTRT, UCNMSTRT, LCNMCHAR, UCNMCHAR options in the SYNTAX NAMING section in the SGML declaration introduced in the "Extended Naming Rules" revision of ISO 8879 (SGML std, cf. https://sgmljs.net/docs/sgmlrefman.html). Like basically everything else on this level, these rules are still used by HTML 5 to this date, and in particular, that while elements names can contain arbitrary characters, only those in the IRV (ASCII) get case-folded for canonization.
    recursive9 months ago
    HTML is a text-based medium. But that doesn't make it a human language. Some human languages are not text-based. And some text is not a human language.
    ANSI C was designed to be written by humans using a plain text editor. That doesn't make it a human language.
    Muromec9 months ago
    But but, I want to have a custom web component and register it under my own name, which can only be properly written in Ukrainian Cyrillic. How dare you not let me have it.
  - daemin9 months ago
    I would argue that for most programs when you're doing string manipulation you're doing it for internal programming reasons - logs, error messages, etc. In that case you are in nearly full control of the strings and therefore can declare that you're only working with ASCII.
    The other normal cases of string usage are file paths and user interface, and the needed operations can be done with simple string functions, and even in UTF8 encoding the characters you care about are in the ASCII range. With file paths the manipulations that you're most often doing is path based so you only care about '/', '\', ':', and '.' ASCII characters. With user interface elements you're likely to be using them as just static data and only substituting values into placeholders when necessary.
    pistoleer9 months ago
    > I would argue that for most programs when you're doing string manipulation you're doing it for internal programming reasons - logs, error messages, etc. In that case you are in nearly full control of the strings and therefore can declare that you're only working with ASCII.
    Why would you argue that? In my experience it's about formatting things that are addressed to the user, where the hardest and most annoying localization problems matter a lot. That includes sorting the last name "van den Berg" just after "Bakker", stylizing it as "Berg, van den", and making sure this capitalization is correct and not "Van Den Berg". There is no built in standard library function in any language that does any of that. It's so much larger than ascii and even larger than unicode.
    Another user said that the main takeaway is that you can't process strings until you know their language (locale), and that is exactly correct.
    daemin9 months ago
    I would maintain that your program has more string manipulation for error messages and logging than for generating localised formatted names.
    Further I do say that if you're creating text for presenting to the user then the most common operation would be replacement of some field in pre-defined text.
    In your case I would design it so that the correctly capitalised first name, surname, and variations of those for sorting would be generated at the data entry point (manually or automatically) and then just used when needed in user facing text generation. Therefore the only string operation needed would be replacement of placeholders like the fmt and standard library provide. This uses more memory and storage but these are cheaper now.
    pistoleer9 months ago
    I agree, but the logging formatters don't really do much beyond trivially pasting in placeholders.
    And as for data entry... Maybe in an ideal world. In the current world, marred by importing previously mangled datasets, a common solution in the few companies I've worked at is to just not do anything, which leaves ugly edges, yet is "good enough".
    heisenzombie9 months ago
    File paths? I think filesystem paths are generally “bags of bytes” that the OS might interpret as UTF-16 (Windows) or UTF-8 (macOS, Linux).
    For example: https://en.m.wikipedia.org/wiki/Program_Files#Localization
    vardump9 months ago
    File paths are scary. The last I checked (which is admittedly a while ago), Windows didn't for example care about correct UTF-16 surrogate pairs at all, it'd happily accept invalid UTF-16 strings.
    So use standard string processing libraries on path names at your own peril.
    It's a good idea to consider file paths as a bag of bytes.
    netsharc9 months ago
    IIRC, the FAT filesystem (before Windows 95) allowed lowercase letters, but there's a layer in the filesystem driver that converted everything to uppercase, e.g. if you did the command "more readme.txt", the more command would ask the filesystem for "readme.txt" and it would search for "README.TXT" in the file allocation table.
    I think I once hex-edited the FA-table to change a filename to have a lowercase name (or maybe it was disk corruption), trying to delete that file didn't work because it would be trying to delete "FOO", and couldn't find it because the file was named "FOo".
    Someone9 months ago
    > It's a good idea to consider file paths as a bag of bytes
    (Nitpick: sequence of bytes)
    Also very limiting. If you do that, you can’t, for example, show a file name to the user as a string or easily use a shell to process data in your file system (do you type “/bin” or “\x2F\x62\x69\x6E”?)
    Unix, from the start, claimed file names where byte sequences, yet assumed many of those to encode ascii.
    That’s part of why Plan 9 made the choice “names may contain any printable character (that is, any character outside hexadecimal 00-1F and 80-9F)” (https://9fans.github.io/plan9port/man/man9/intro.html)
    daemin9 months ago
    That's what I mean, you treat filesystem paths as bags of bytes separated by known ASCII characters, as the only path manipulation that you generally need to do is to append a path, remove a path, change extension, things that only care about those ASCII characters. You only modify the path strings at those known characters and leave everything in between as is (with some exceptions using OS API specific functions as needed).
    numpad09 months ago
    Just using UTF-8 for username at all is problematic. That has been a major PSA item for Windows users in my language literally since 90s and still is. Microsoft switched home folder names from Microsoft Account username to shortened user email for that reason.
    account429 months ago
    Yes and most importantly, that interpretation is for display purposes ONLY. If your file manager won't let me delete a file because the name includes invalid UTF-16/UTF-8 then it is simply broken.
    BoringTimesGang9 months ago
    Now double all of that effort, so you can get it to work with Windows' UTF-16 wstrings.
    account429 months ago
    Better to just convert WTF-16 (Windows filenames re not guaranteed to be valid UTF-16) to/from WTF-8 at the API boundary and then do the same processing internally on all platforms.
- PaulDavisThe1st9 months ago
  He may be right, but approximately 75% of the problems he describes are all Microsoft-ecosystem specific.
  In Unix-land we don't use wchar_t or UTF-16, and his article is a good demonstrations of why not.
  - pjmlp9 months ago
    UNIX land is even worse in international languages support.
    As in, there isn't even something on POSIX at the level other operaring systems support for localisation.
    Yes there is some locale stuff, however not enough for all stuff, hence why every modern programming language happens to have this as part of their standard library.
    PaulDavisThe1st9 months ago
    When the notion of what a "string" is differs so much from language to language, i18n is never going to be an effective part of POSIX.
    account429 months ago
    Is there a platform where you can't use ICU?
- account429 months ago
  > That said, 99% time when doing upper- or lowercase operation you're interested just in the 7-bit ASCII range of characters.
  And std::tolower/toupper is the wrong tool for that as well.
- yas_hmaheshwari9 months ago
  Wow, I came here to write exactly that, and its heartening to see that I am not crazy
  Just reading the title, with microsoft.com in bracket, I knew two things: 1. It would be written by Raymond Chen 2. That article is going to be awesome
Animats9 months ago
> From the article: "I find it quaint that Unicode character names are ALL IN CAPITAL LETTERS, in case you need to put them in a Baudot telegram or something."
I had to do that. When we had our steampunk telegraph office at steampunk conventions [1], people could text in a message via SMS, it would be printed on a Model 14 or 15 Teletype, put in an envelope, and hand-delivered. People would use emoji in messages, and the device could only print Baudot, or International Telegraphic Alphabet #2, which is upper case only with some symbols.
Emoji translation would cause the machine to hammer out
```
    (RED-HEART)
```
or whatever emoji description was needed.
Used the emoji list at [2], an older version.
[1] https://vimeo.com/124065314
[2] http://unicode.org/emoji/charts-beta/full-emoji-list.html
alkonaut9 months ago
Handle text in two ways: either it's controlled by you and you can do simple, efficient, and naive processing, or it's not (it's translated resources, or user input) and you can't.
For the former case, you don't need any complex logic. A very typical example would be: i'm serializing a field or constructing a url so I want the variable name "Someproperty" as a lower case string. The lowercase transform is completely naive. I know exactly what the range of possible characters are and they aren't going to be Turkish or emoji, not least because I have asserted they won't be. And THIS is what the regular programming functions for upper/lower case are for. They are important, and they are most often correct. Because for all the other cases (i18n, user input, ...) you probably don't want to do toUpper/toLower at all to begin with!
Example, if you present a message to the user from resources so your code is translate("USER_DIALOG_QUESTION_ABOUT_FISH") which you want to lookup knowing it will be in sentence case, and present as uppercase, what will you do? Here you likely can't, and shouldn't, do toUpper(translate(resourceKey)). Just use two resources if you want to correctly transform text. The toUpper function isn't made for this.
Trying to use a complex i18n-ready toUpper/toLower only helps part of the way. It still might not understand whether two S are contracted or whether something is a proper Noun and must stay capitalized. So it adds complexity and still isn't correct. Just use two resources!
- account429 months ago
  > For the former case, you don't need any complex logic. A very typical example would be: i'm serializing a field or constructing a url so I want the variable name "Someproperty" as a lower case string. The lowercase transform is completely naive. I know exactly what the range of possible characters are and they aren't going to be Turkish or emoji, not least because I have asserted they won't be. And THIS is what the regular programming functions for upper/lower case are for. They are important, and they are most often correct. Because for all the other cases (i18n, user input, ...) you probably don't want to do toUpper/toLower at all to begin with!
  C++ std::tolower/toupper (which are really just C tolower/toupper) are the wrong tool for that too though because they depend on the process locale which makes them a) horribly inefficient and b) prone to blow your program up in interesting ways on customer systems. Not quite as bad as the locale-dependent standard number parsing functions that want . in some localses and , in others but still should never be used.
PhilipRoman9 months ago
Thought this was going to be about and-not-ing bytes with 0x20. Wrong for most inputs but sure as hell faster than anything else.
cyxxon9 months ago
Small nitpick: the example "LATIN SMALL LETTER SHARP S (“ß” U+00DF) uppercases to the two-character sequence “SS”:³ Straße ⇒ STRASSE" is slightly wrong, it seems to me, as we now do actually have a uppercase version of that, so it should uppercase to "Latin Capital Letter Sharp S" (U+1E9E). The double-S thing is still widely used, though.
- mkayokay9 months ago
  Duden mentions this: "Bei Verwendung von Großbuchstaben steht traditionellerweise SS für ß. In manchen Schriften gibt es aber auch einen entsprechenden Großbuchstaben; seine Verwendung ist fakultativ ‹§ 25 E3›."
  But isn't it also dependent on the available glyphs in the font used? So f.e. it needs to be ensured that U+1E9E exists?
  - NoInkling9 months ago
    According to Wikipedia:
    > "Since 2024 the capital ⟨ẞ⟩ is preferred over ⟨SS⟩."
    https://en.wikipedia.org/wiki/%C3%9F
    Check reference #5 and compare it to the older wording in reference #12.
  - Kwpolska9 months ago
    I don't think there exists any code that makes uppercasing decisions based on the selected font. Besides, if it doesn't exist in the current font, there's probably a fallback font.
- Muromec9 months ago
  But what if you need to uppercase the historical record in a vital records registry from 1950ies, but and OCRed last week? Now you need to not just be locale-aware, but you locale should be versioned.
- pjmlp9 months ago
  Lowering case is even better, because a Swiss user would expect the two-character sequence “SS“ to be converted into “ss“ and not “ß“.
  And thus we add country specific locale to the party.
  - account429 months ago
    Not just a Swiss user as there are many German words that use ss and not ß. And having an ss where there should be an ß will be a lot less disruptive as the inverse because people are used to ASCII limitations.
- Rygian9 months ago
  The footnote #3 in the article (called as part of your quote) covers the different ways to uppercase ß with more detail.
himinlomax9 months ago
> And in certain forms of the French language, capitalizing an accented character causes the accent to be dropped: à Paris ⇒ A PARIS.
That's incorrect, using diacritics on capital letters is always the preferred form, it's just that dropping them is acceptable as it was often done for technical reasons.
ChrisMarshallNY9 months ago
I generally just use the language-supported tolower/upper() (or similar) routines. I assume that they take things like UTF and alternative type systems into account.
I'm not sure about other languages, but Swift has pretty intense String support[0], and can go quite a long ways.
Someone actually wrote a whole book about just Swift Strings[1].
[0] https://docs.swift.org/swift-book/documentation/the-swift-pr...
[1] https://flight.school/books/strings/
serbuvlad9 months ago
The real insights here are that strings in C++ suck and UTF-16 is extremely unintuitive.
- criddell9 months ago
  Strings in C++ standard library do suck (and C++ is my favorite language).
  As for UTF-16, well, I don't know that UTF-8 is a whole lot more intuitive:
  > And for UTF-8 data, you have the same issues discussed before: Multibyte characters will not be converted properly, and it breaks for case mappings that alter string lengths.
  - recursive9 months ago
    UTF-16 has all the complexity of UTF-8 plus surrogate pairs.
    zahlman9 months ago
    Surrogate pairs aren't more complex than UTF-8's scheme for determining the number of bytes used to represent a code point. (Arguably the logic is slightly simpler.) But the important point is that UTF-16 pretends to be a constant-length encoding while actually having the surrogate-pair loophole - that's because it's a hack on top of UCS-2 (which originally worked well enough for Microsoft to get married to; but then the BMP turned out not to be enough code points). UTF-8 is clearly designed from scratch to be a multi-byte encoding (and, while the standard now makes the corresponding sequences illegal, the scheme was designed to be able to support much higher code points - up to 2^42 if we extend the logic all the way; hypothetical 6-byte sequences starting with values FC or FD would neatly map up to 2^31).
zzo38computer9 months ago
First, you should consider if you even need case folding; for many uses it will be unnecessary, anyways.
Furthermore, the proper way to do case folding will depend on such things as the character set, the language, the specific context of the text being converted (e.g. in some cases specific letters are required, such as abbreviations of the names of SI units), etc. And then, it is not necessarily only "uppercase" and "lowercase", anyways.
There might even be different ways to do by the same language, with possibly disagreements about usage (e.g. the German Eszett did not have an official capital form until 2017, although apparently some type designers did it anyways (and it was in Unicode before then, despite that)).
If the character set is Unicode, then there is not actually the correct way to do it, despite what the Unicode Conspiracy insists otherwise.
Also, for some uses the way that it will need to be done, there will be a specific way that it is required (due to the way that a file format or a protocol or whatever is working), so in such a case if the character set is something other than ASCII then you cannot just assume that it will always work in the same way.
You also cannot necessarily depend on the locale for such a thing, since it might depend on the data, as well.
These things can be as bad as they are, but Unicode just makes these things worse than that. If a program requires a specific case folding and then it will not work because it is the wrong version of Unicode and it is possible to be a security issue and/or other problems.
(Another problem, which applies even if you do not use case folding, is that some people think that all text is or should be Unicode and that one character set is suitable for everything. Actually, one character set cannot be suitable for everything, regardless of what character set it is. Even if it was (which it isn't), it wouldn't be Unicode.)
high_na_euv9 months ago
In cpp basic things are hard
- johnnyjeans9 months ago
  nothing about working with locales, or text in general, is basic. we were decades into working with digital computers before we moved past switchboards and LEDs. don't take for granted just how high of a perch upon the shoulders of giants you have. that's exactly how the mistakes in the blog post get made.
  - high_na_euv9 months ago
    Ive worked in various languages like C#, C and CPP and I know where Ive been fighting what kinds of problems.
    People always had some fancy reasoning about why things that should just work are not, but then a few years pass and things are improved.
    C++ is getting closer and closer to langs like C# in terms of making it harder to shot yourself, but still there is a huge room for improvement
- onemoresoop9 months ago
  It's subjective but I find C++ extremely ugly.
flareback9 months ago
He gave 4 examples of how it's done incorrectly, but zero actual examples of doing it correctly.
- TheGeminon9 months ago
  > Okay, so those are the problems. What’s the solution?
  > If you need to perform a case mapping on a string, you can use LCMapStringEx with LCMAP_LOWERCASE or LCMAP_UPPERCASE, possibly with other flags like LCMAP_LINGUISTIC_CASING. If you use the International Components for Unicode (ICU) library, you can use u_strToUpper and u_strToLower.
- crote9 months ago
  The correct thing to do is to not do it at all. If text is 3rd-party supplied, treat it like an opaque byte sequence. Alternatively, pay a well-trained human to do it by hand.
  All other options are going to result in edge cases where you're not handling it properly. It's like trying to programmatically split a full name into a first name and a last name: language doesn't work like that.
- commandlinefan9 months ago
  for (int i = 0; i < strlen(s); i++) { s[i] ^= 0x20; }
  - calibas9 months ago
    Thank you for this universal approach. I can now toggle capitalization on/off for any character, instead of just being limited to alphabetic ones!
    Jokes aside, I was kinda hoping for a good answer that doesn't rely on a Windows API or an external library, but I'm not sure there is one. It's a rather complex problem when you account for more than just ASCII and the English language.
    TZubiri9 months ago
    Next up, check out our vector addition implementation of Hello+World. Spoiler alert, the result is Zalgo
  - vardump9 months ago
    Surely you meant:
    s[i] &= ~0x20;
    We're talking about converting to upper case after all! As an added benefit, every space character (0x20) is now a NUL byte!
    account429 months ago
    Free strtok!
9 months ago
undefined
HPsquared9 months ago
I thought this was going to be about adding or subtracting 32. Old school.
- klyrs9 months ago
  I do hope you mean bitwise "addition" and "subtraction" -- (c => c&0xdf) or (c => c|0x20)
  - HPsquared9 months ago
    Tbh I come at this as a plebeian Excel user
9 months ago
undefined
codr79 months ago
C++, where every line of code is a book waiting to be written.
9 months ago
undefined
9 months ago
undefined
guerrilla9 months ago
C is hard. It seems like C++ just made things way harder. I don't regret skipping it. Why not just go right to Java, C#, JS, Haskell, etc. and do what you need in C.
account429 months ago
A popular but wrong way to do Unicode
> wchar_t
PoignardAzur9 months ago
So I'm going to be that guy and say it:
Man, I'm happy we don't need to deal with this crap in Rust, and we can just use String::to_lowercase. Not having to worry about things makes coding fun.
- lilyball9 months ago
  While certainly much better, you still need to be aware that doing case conversion absent any locale information will never be perfect. If you want proper locale-aware conversion you can use the icu crate (https://docs.rs/icu/latest/icu/).
  - account429 months ago
    Exactly, simple "unicode-aware" case conversions are a trap. You are always going to need much more.
the_gorilla9 months ago
Why are some functions addressable in C++ and others not? Seems like a pointless design oversight.
- bialpio9 months ago
  Footnote in the article provides the following explanation: "The standard imposes this limitation because the implementation may need to add default function parameters, template default parameters, or overloads in order to accomplish the various requirements of the standard."
ahartmetz9 months ago
...and that is why you use QString if you are using the Qt framework. QString is a string class that actually does what you want when used in the obvious way. It probably helps that it was mostly created by people with "ASCII+" native languages. Or with customers that expect not exceedingly dumb behavior. The methods are called QString::toUpper() and QString::toLower() and take only the implicit "this" argument, unlike Win32 LCMapStringEx() which takes 5-8 arguments...
- cannam9 months ago
  QString::toUpper/toLower are not locale-aware (https://doc.qt.io/qt-6/qstring.html#toLower)
  Qt does have a locale-aware equivalent (QLocale::toUpper/toLower) which calls out to ICU if available. Otherwise it falls back to the QString functions, so you have to be confident about how your build is configured. Whether it works or not has very little to do with the design of QString.
  - ahartmetz9 months ago
    I don't see a problem with that. You can have it done locale-aware or not and "not" seems like a sane default. QString will uppercase 'ü' to 'Ü' just fine without locale-awareness whereas std::string doesn't handle non-ASCII according to the article. The cases where locale matters are probably very rare and the result will probably be reasonable anyway.
    account429 months ago
    That attitude is how you end up with exploits because your case folding is different from some other system you interact with.
- vardump9 months ago
  You just want a banana, but you also get the gorilla. And the jungle.
- aetherspawn9 months ago
  I will admit I don’t love the Qt licensing model, but most things in Qt just work as they are supposed to, and on every platform too.
- account429 months ago
  QString is how you ensure you cannot open/delete some files you WILL eventually encounter.