If it is text game needs to show to user then every version of the text that is needed is a translated text. Programmer will never know if context or locale will need word order changes or anything complicated. Just trust the translation team.
If text is coming from user - then change design until its not needed to 'convert'. There are major issues just to show user back what he entered! Because the font for editing and displayed text could be different. Not even mentioning RTL and other issues.
Once ppl learn about localization the questions like why a programming language does not do this 'simple text operation' are just a newcomer detector. :)
I think you are purposefully misinterpreting the question. They're not asking about converting the case of any Unicode string with locale sensitivity, they're asking about converting the case of ASCII characters.
What if your game needs to talk to a server and do some string manipulation in between requests? Are you really going to architect everything so that the client doesn't need to handle any of that ever?
Of course! Your string manipulation with user entered attributes like display names or chat messages are 1 millimeter away from old good sql 'Bobby; drop table students'. Never ever do that if you can avoid it. Every time someone 'just concatenates' two strings like to add ie 'symbol that represents input button' programmer makes bad bug that will be both annoying and wrong. Games should use substitution patterns guided by translation team. Because there is no ASCII culture in like around 15 typically supported by big publishers.
There are exceptions like platform provided services to filter ban words in chat. And even there you don't have to do 'things with ASCII characters'. Yeah, players will input unsupported symbols everywhere they can and you need to have good replacement characters for those and fix support for popular emojis regularly. That is expected by communities now.
I'm confused now. The article specifically mentions issues with UTF-16 and UTF-32 unicode characters outside the basic multilingual plane (BMP).
The article talks about wstrings for good reason. If you're converting narrow strings, you don't need to be this fancy. Just loop over the string and edit it in place.
If you are operating on wide strings, there is no suitable single solution, partly because wstring is a terrible type. It's different widths on different platforms, and no string encoding format uses a generalized wsring, they have mandatory min/max character byte widths. So a wstring tells you nothing about the actual encoded string contents semantic representation.
The C++ stdlib could include a fully unicode aware string type set, and surrounding library. But personally I think C++ isn't the kind of language to provide an opinionated stdlib module for such a complex task. And there's no way to implement such a module without being very opinionated about something.
Since you mention narrow strings in the context of wstring, just to make sure... you can't convert a UTF-8 std::string character by character, in-place (in case that's what you meant).
7-bit ASCII code points are fine, but outside that it's not guaranteed that one UTF-8 byte converts into exactly one UTF-8 byte when converting case.
In most type definitions you cannot convert UTF8 via simple iteration because the type generally represents a code point and not a character.
You can have a library where UTF8 characters are a native type and code points are a mostly-hidden internal element. But again, that's highly opinionated for C++.
And it's a huge footgun. There is no ascii type in C++. People will use the generalized tolower for UTF8 encoded in narrow strings and have issues.
You could say the generalized tolower should support all the different width/encoding combinations and sort it out. But that's still highly opinionated as far as performance is concerned.
Generalized string conversion is a very complex problem and you really cannot simplify it in a way that will satisfy most C++ users. Just use ICU or utf8cpp if you want to do string operations and don't care what's going on under the hood. But even then I can't recommend just 1 library, because no perfect 3rd party library exists. A perfect first party library definitely could not exist.
Then why does std::max() exist?
>People will use the generalized tolower for UTF8 encoded in narrow strings and have issues.
tolower() and toupper() work correctly on UTF-8 strings, because UTF-8 was specifically designed so that non-ASCII characters were represented by sequences of purely non-ASCII bytes.
>Generalized string conversion is a very complex
Hence why people who say C++ should have a tolower() that operates on strings are not asking more complex Unicode support.
> there's no way to implement such a module without being very opinionated about something.
indeed! Boost.Nowide[1] is such an opinionated library.
[1] https://www.boost.org/doc/libs/master/libs/nowide/doc/html/i...
Could not agree more. Any time I touch a C I want to scoop my brain out of my ear. So many simple unbelievably common operations have fifty "best" ways to do them, when they should have one happy path 99% of usecases require baked in. Nobody should ever have to seriously consider something as ridiculous as "is tolower addressable?".
What conceivable reason would there be to ever need to do that? If the server takes commands in upper case, then have them in upper case from the start. If the server takes commands in lower case, have them in lower case from the start. If the server specifies that you need to invert the case of its response to use in the next request, find a server developed by someone not crazy.
Should only ever be needed for text from the user, and in that case, as GP said, find a way to examine it as-is, don't "convert".
> Ease of use?
What ease of use? When has futzing around with case ever made anything easier?
> Console commands (i.e. from Quake to minecraft)?
Why would those necessitate changing case?
In games, you can possibly get away with this. Most other people need to worry about things like string collation (locale-aware sorting) for user-supplied text.
I'd assume SleepyMyroslav doesn't apply to devs willing to spend weeks at time to handle all the complexity in full.
I am not in gamedev, but I frequently have to develop middleware that takes in user entered data and formats it in a way that will import into a 3rd party system without errors. And that sometimes means changing the case on strings.
In my experience as a developer, this is very very common requirement.
Luckily I am not forced to use a low level language for any of my work. In C# I can simply do this: "hello world".ToUpper();
Two decades ago some developer probably went "Yeah, obviously all names start with capital letters!", not realizing that there are in fact plenty of names which start with a lowercase letter. So they added an input validation test which checks for capitals, which meant everyone feeding that system had to format their data. A whole ecosystem grew around the format of the output of that system, and now you're suddenly rewriting the system and you run into weird and plain wrong capitalization requirements for no technical reason whatsoever.
Alternatively, the same but start with punch cards which predate ASCII and don't distinguish between uppercase and lowercase letters.
> In C# I can simply do this: "hello world".ToUpper()
... which does not work.
Take a look at the German word "straße" (street), for example. Until very recently the "ß" character did not have an uppercase variant, so a ToUpper would convert it to "STRASSE". This is a lossy operation, as the reverse isn't true: the lowercase variant of "KONGRESSSTRASSE" (congress street) is not "kongreßstraße" - it's supposed to be "Kongressstraße".
It can get even worse: the phrase "in Maßen" (in moderate amounts) naively has the uppercase variant "IN MASSEN" - but that means "in huge amounts"! In that case it is probably better to stick to "IN MASZEN".
And then there's Turkish, where the uppercase variant of the letter "i" is of course "İ" rather than "I" - note the dot.
So no, you cannot "simply" use ToUpper() / ToLower(). They might work well enough of basic ASCII for languages like English, but they have a habit of making a mess out of everything else. You're supposed to use CultureInfo.TextInfo.ToUpperCase() and explicitly specify what locale the text is in so that it can use the right converter. Which is of course essentially impossible in general-purpose text fields.
In practice that means your options are a) giving up on the concept of uppercase/lowercase conversion and just passing it as-is, or b) accepting that you are inevitably going to be silently corrupting your data.
Have you ever read the documentation? https://learn.microsoft.com/en-us/dotnet/fundamentals/runtim...
Sure, you can now do case conversion for a specific culture, but which one?
Yes we can simply ToUpper(). We just can’t ToUpper().ToLower(), but that’s useless cause we have the original string if we need it and fine if we don’t need it.
Hmm still actual: https://www.moserware.com/2008/02/does-your-code-pass-turkey...
This is completely irrelevant because culture-sensitive case conversion relies on ICU/NLS.
Game UI is the place I’d expect to most likely come across horrific abuses of localization precisely because game UI is such a cobbled together layer of hacks on hacks.
Your web browser is doing it right now as you are reading this comment.
https://github.com/baikety/uWebKit
https://zenfulcrum.com/browser/docs/Readme.html
https://github.com/roydejong/chromium-unity-server
There are a lot more, I just got bored at 3.And it's not just Unity. Several exist for Unreal as well.
Why? Specifically because 2D layout and text rendering suck so much in game engines. What's ~50MB matter when you're shipping several GB of game assets?
An acceptable solution is given at the end of the article:
> If you use the International Components for Unicode (ICU) library, you can use u_strToUpper and u_strToLower.
Makes you wonder why this isn't part of the C++ standard library itself. Every revision of the C++ standard brings with itself more syntax and more complexity in the language. But as a user of C++ I don't need more syntax and more complexity in the language. But I do need more standard library functions that solves these ordinary real-world programming problems.
On the other hand, libicu is 37MB by itself, so it's not something someone can write in a weekend and ship.
Any tool which is old enough will have a thousand ways to do something. This is the inevitability of software and programming languages. In the domain of C++, which has a size mammoth now, everyone expects this huge pony to learn new tricks, but everybody has a different idea of the "new tricks", so more features are added on top of its already impressive and very long list of features and capabilities.
You want libICU built-in? There must be other folks who want that too. So you may need to find them and work with them to make your dream a reality.
So, C++ is doing fine. It's not that they omitted Unicode during the design phase. Unicode has arrived later, and it has to be integrated by other means. This is what libraries for.
Even for Python it took well over a decade, and people still complain about the fact that they don't get to treat byte-sequences transparently as text any more - as if they want to wrestle with the `basestring` supertype, getting `UnicodeDecodeError` from an encoding operation or vice-versa, trying to guess the encoding of someone else's data instead of expecting it to be decoded on the other side....
But in C++ (and in C), you have the additional problem that the 8-bit integer type was named for the concept of a character of text, even though it clearly cannot actually represent any such thing. (Not to mention the whole bit about `char` being a separate type from both `signed char` and `unsigned char`, without defined signedness.)
The now-invalid assumptions couldn't have been avoided 50 years ago. Fixing them now in C++ is difficult or impossible, but still, the end result is a ton of brokenness baked into C++.
Languages developed in the 21st century typically have some at least half-decent Unicode support built-in. Unicode is big and complex, but there's a lot that a language can do to at least not silently destroy the encoding.
Or perhaps I should say naïve.
>>> 'ß'.upper()
'SS'
>>> 'ß'.lower()
'ß'
>>> 'ß'.casefold()
'ss'
There are a lot of really complicated tasks for Unicode strings. String casing isn't really one of them.(No, Python can't turn 'SS' back into 'ß'. But doing that requires metadata about language that a string simply doesn't represent.)
> (No, Python can't turn 'SS' back into 'ß'. But doing that requires metadata about language that a string simply doesn't represent.)
Yes that's my point. Because in typical languages strings don't store language metadata, this is impossible to do correctly in general.
It looks like an old NSString method that's available in both Obj-C and Swift.
The casefold function is even older than that. https://developer.apple.com/documentation/foundation/nsstrin... Its documentation specifically includes a discussion of the Turkish İ/I issue.
Allcaps (and smallcaps) has always existed in signage everywhere. Before the computing age, letters where just arbitrary metal stamps -- and just whatever you could draw before that. Historically, language was not as standardized as it is today.
True capitalisation has always existed but even that didn’t seem to have required a capital ß - why now?
assert_eq!("ὀδυσσεύς", "ὈΔΥΣΣΕΎΣ".to_lowercase());
[Notice that this is in fact entirely impossible with the naive strategy since Greek cares about position of symbols]Some of the latter examples aren't cases where a programming language or library should just "do the right thing" but cases of ambiguity where you need locale information to decide what's appropriate, which isn't "just as wrong as the C++ version" it's a whole other problem. It isn't wrong to capitalise A-acute as a capital A-acute, it's just not always appropriate depending on the locale.
assert_eq!("\u1F41δυσσεύς", "ὈΔΥΣΣΕΎΣ".to_lowercase());
or assert_eq!("\u03BF\u0314δυσσεύς", "ὈΔΥΣΣΕΎΣ".to_lowercase());
For display it doesn't matter but most other applications really want some kind of normalizatin which does much much more so having a convenient to_lowercase() doesn't buy you as much as you think and can be actively misleading.That doesn’t prevent adding a new function that converts an entire string to upper or lowercase in a Unicode aware way.
What would be wrong with adding new correct functions to the standard library to make this easy? There are already namespaces in C++ so you don’t even have to worry about collisions.
That’s the problem I see. It’s fine if you have a history of stuff that’s not that great in hindsight. But what’s wrong with having a better standard library going forward?
It’s not like this is an esoteric thing.
Unicode and character encodings are pretty esoteric. So are fonts. The stuff is technically everywhere and fundamental, but there are many encodings, technical details, etc. And most programmers only care about one language, or else only use UTF-8 with the most basic chars (the ones that agree with ASCII). That isn't terrible. You only need what you actually need. Most programs don't strictly have to be built for multiple random languages, and there is kind of a standard methodology to learn before you can do that.
I politely disagree. None of the programming languages which started integrating Unicode was targeting from bare metal to GUI, incl. embedded and OS development at the same time.
C++ has a great target area when compared to other programming languages. There are widely used libraries which compile correctly on PDP-11s, even if they are updated constantly.
You can't just say "I'll be just making everything Unicode aware, backwards compatibility be damned, eh".
But the C++ overlords could always add a new type that is Unicode-aware. Converting one Unicode string to another is a purely in-memory, in-CPU operation. It does not need any I/O and it does not need any interaction with peripherals. So one can dream that such a type along with its conversion routines could be added to an updated standard library without breaking existing code that compiles correctly on PDP-11s.
...but it's a complex operation. This is what libICU is mostly for. You can't just look-up a single table and convert a string to another like you work on ASCII table or any other simple encoding.
Germans have their ß to S (or capital ß depending on the year), Turkish has ı/I and i/İ pairs, and tons of other languages have other rules.
Esp, this I/ı and İ/i pairs break tons of applications in very unexpected ways. I don't remember how many bugs I reported, and how many workarounds I have implemented in my systems.
Adding a type is nice, but the surrounding machinery is so big, it brings tons of work with itself. Unicode is such a complicated system, that I read that even you need two UTF-16 characters (4 bytes in total) to encode a single character. This is insane (as in complexity, I guess they have their reasons).
Because there are more than 65,535 characters. That's just writing systems, not Unicode's fault. Most of the unnecessary complexity of Unicode is legacy compatibility: UTF-16 & UTF-32 are bad ideas that increase complexity, but they predate UTF-8 which actually works decently well so they get kept around for backwards compatibility. Likewise with the need for multiple normalization forms.
It takes up to eight bytes per character in Unicode if you want to support both Chinese and Japanese in a single font using IVS(and I don't think there's any font that actually supports this).
AFAICS(As far as I can search), Simplified(PRC) and Traditional(Taiwan) Chinese encoding are respectively called GB2312 and Big5, and they're both two byte encodings with good practical coverage. Same applies for Japanese Shift_JIS. If e.g. :flag_cc: were allowed to be used as start-of-language marker, one could theoretically cut that back down to two bytes per character without losing much and actually improving language supports.
So, seeing this just moved the complexity of Unicode one notch up in my head, and I respect the guys who designed and made it work. It was not whining or complaining of any sort. :)
FYI, it's never S. If there is no better option then SS and ss are the proper capital and lowercase substitutions.
Until mid-2000s there was no certainty Unicode will eventually defeat competitors. In real it havenʼt fully yet - GB2312 and Tron are still locally prevailing, and IBM still jogs with EBCDIC. But at its early times nobody was reasonably sure, and Java attempt could have failed as well. (More so Java approach for UCS-2 was wrong - already commented nearby.)
You can actually end up in a cleaner state in C++, as there is no obligation to use the standard library string classes, but it's pretty much required in Java.
By not committing to UCS-2 early C++ left the road open to UTF-8. I'll concede that UTF8 has risen as the clear winner for more than a decade and C++ is well past the point that it should have at least basic builtin support. The problem is that there is at least one important C++ platform that only very recently added full support for the encoding in their native API.
Only because of the strange desire of programmers to never stop. Not every program is a never ending story. Most are short stories their authors bludgeon into a novel.
Programming languages bloat into stupidity for the same reason. Nothing is ever removed. Programmers need editors.
Isn't that mostly just from tables derived from the Unicode standard?
That's life! The perfect way does not exist. The best you can do is be aware of the tradeoffs, and languages like C++ absolutely throw them in your face at every single opportunity. It's fatiguing, and writing in javascript or python allows us to uphold the facade that everything is okay and that we don't have to worry about a thing.
Swift has been developed in the modern times, and it's able to tackle Unicode properly, e.g. makes distinction between codepoints and grapheme clusters, and steers users away from random-access indexing and having a single (incorrect) notion of a string length.
How about you just dont? If it's a constant in your code, you probably use ASCII anyway or can do a static mapping. If it's user user input -- just don't str lower / str upper it.
Regardless of what ISO language we are talking about.
Because the C++ standard library cares about binary size and backwards compatiblity, both of with are incompatible with a full Unicode implementation. Putting this in the stdlib means everyone has to pay for it even when you don't need it.
Libraries are fine, not everything needs to be defined by the language itself.
Plainly no need if there is a separate easily attachable library (and with permissible license). What C++ had to do - provide character (char{8,16,32}_t) and string types - it has done.
[1] Ended up looking at https://github.com/JakubSzark/zig-string
https://doc.rust-lang.org/stable/std/primitive.str.html#meth...
> ‘Lowercase’ is defined according to the terms of the Unicode Derived Core Property Lowercase.
https://doc.rust-lang.org/stable/std/primitive.str.html#meth...
> ASCII letters ‘A’ to ‘Z’ are mapped to ‘a’ to ‘z’, but non-ASCII letters are unchanged.
Now, "perfectly" is very strong. For example, the Turkish i problem. That is not solved. But 99% of Unicode stuff is handled correctly by default.
Does that include context-dependent conversion rules like o'reilly -> O'Reilly?
Because then every change in Unicode would need to be standardized in C++ as well. Yup. Can't have Unicode due to committee friction.
Stroustrup, laugheth!
These are mostly unicode or linguistics problems.
But there are always dragons. It's strings. The mere assumption that they can be transformed int-by-int, irrespective of encoding, is wrong. As is the assumption that a sensible transformation to lower case without error handling exists.
It should be marked [[deprecated]], yes. There is no good reason to use std::tolower/toupper anywhere - they can neither do unicode properly nor are they anywhere close to efficient for ASCII. And their behavior depends on the process-global locale.
If you are handling multilingual text the locale is mandatory metadata.
The lowercase of "DON'T FUSS ABOUT FUSSBALL" is "don't fuss about fußball". Unless you're in Switzerland.
It is probably time for an Esperanto advocate to show up and set us all straight.
Se fareblus oni, jam farintus oni. (It definitely won't happen on an echo-change day like today, either. ;))
Contra my comrade's comment, Esperanto orthography is firmly European, and so retains European-style casing distinctions; every sound thus still has two letters -- or at least two codepoints.
(There aren't any eszettesque bigraphs, but that's not saying much.)
What are you talking about? In Esperanto, one letter equals one sound.
If you're printing a name, you're probably printing the name for the current user, not for the person who entered it at some point. If you're going to try to convert back like that, you also need to store a timestamp with every string in case a language changes its rules (such as permitting ẞ instead of SS when capitalising ß). And even then, someone might intend to use the new spelling rules, or they might not, who knows!
This article probably boils down to "programmers don't realise graphemes aren't characters and characters aren't bytes even though they usually are in US English". The core problem, "text processing looks easy as long as you only look at your own language", is one that doesn't just affect computers.
Your best bet is to just avoid the entire problem by not processing input further than basic input sanitisation, such as removing whitespace prefixes/suffixes and maybe stripping out invalid unicode so it can't be used as a weird stored attack.
islower is actually supposed to account for the user's "locale", which includes their language.
The key takeway is that lowercasing a string needs to be done on the whole string, not individual characters, even if std::string had a way to iterate over codepoints instead of bytes (or code units, in the case of wstring).
And there isn't a standard way to do that, you either meed to use a platform specific API, like the windows function mentioned, or use a library like ICU.
I had to do that. When we had our steampunk telegraph office at steampunk conventions [1], people could text in a message via SMS, it would be printed on a Model 14 or 15 Teletype, put in an envelope, and hand-delivered. People would use emoji in messages, and the device could only print Baudot, or International Telegraphic Alphabet #2, which is upper case only with some symbols.
Emoji translation would cause the machine to hammer out
(RED-HEART)
or whatever emoji description was needed.Used the emoji list at [2], an older version.
[1] https://vimeo.com/124065314
[2] http://unicode.org/emoji/charts-beta/full-emoji-list.html
That said, 99% time when doing upper- or lowercase operation you're interested just in the 7-bit ASCII range of characters.
For the remaining 1%, there's ICU library. Just like Raymond Chen mentioned.
I think it's more the exact opposite.
The only times I'm dealing with 7-bit ASCII is for internal identifiers like variable names or API endpoints. Which is a lot of the time, but I can't ever think of when I've needed my code to change their case. It might literally be never.
On the other hand, needing to switch between upper, lower, and title case happens all the time, always with people's names and article titles and product names and whatnot. Which are never in ASCII because this isn't 1990.
This is a very silly statement. I'm willing to believe that you have lots of cases where those things are outside the ASCII range. Perhaps even most of the cases, depending on where you live. But I do not believe for one second that it never happens.
If somebody's name happens to fit into ASCII that's irrelevant because it's not guaranteed, so you can never blindly do an ASCII case conversion.
For text data meant for users, I literally cannot remember the last time I used a string in ASCII format as opposed to UTF-8 (or UTF-16 in JS). It's certainly over a decade ago.
So yes, when I say never, I literally mean never. Nothing "very silly" about it, sorry.
(Again, excepting identifiers, where case conversion is not generally applicable.)
On an enterprise app these little string manipulations are a drop in the bucket. In a game they might not be. Sort that stuff out at compile time, or commit time.
The real problem is accepting non-ASCII input from user where you later assume it's ASCII-only and safe to bitfuck around.
For some reason they have a hard-on for putting last names in capital letters and they still have systems in place that use ASCII
Usually they'll accept it, but some parts of the backend are still running code from the 60's.
So you get your name rendered properly on the web interface, and most core features, but one day you're wandering off from the beaten path, by, like, requesting some insurance contract, and you'll see your name at the top with some characters mangled, depending on what your name's like. Mine is just accented latin characters so it usually drops the accents ; not sure how it would work if your name was in an entirely different alphabet
Guess what, I'm part of this 70% and I also work in a bank and I know exactly how.
Not a single letter in my name (any of them) can be represented with ASCII. When it is represented in UTF-8, most of the people who have to see it can't read it anyway.
So my identity document issued by the country which doesn't use Latin alphabet includes ASCII-representation of my name in addition to canonical form in Ukrainian Cyrillic. That ASCII-rendering is happily accepted by all kinds of systems that only speak ASCII.
People still can't pronounce it and it got misspelled like yesterday when dictated over the phone.
Now regarding the accents, it's illegal to not support them per GDPR (as per case law, discussed here few years ago).
Maybe it needs to be communicated more often, like way more often, until it sticks.
? The more first world you are the more your alphabet is taken into consideration
Hint: You use the word """romanized"""
99% of use cases I've seen have nothing to do with human language.
1% human language case that is needs to be handled properly using a proper Unicode library.
Your mileage (percentages) may vary depending on your job.
In all seriousness, though, in the real world ASCII works only for a subset of a handful of languages. The vast majority of the population does not read or write any English in their day to day lives. As far as end users are concerned, you should probably swap your percentages.
ASCII is mostly fine within your programs like the parser you mention in your other comment. But even then, it’s better if a Chinese user name does not break your reporting or logging systems or your parser, so it’s still a good idea to take Unicode seriously. Otherwise, anything that comes from a user or gets out of the program needs to behave.
That is the number of English-speaking people, as in people who can speak English. Not necessarily people who use it every day. In any case, ASCII only works for a subset of even English if you ignore all loan words and diacritics in things like proper names.
> So any code that deals solely with programmers as users can easily just use standard ASCII as default, and never see any problems.
That would not be much code at all, given that most code deals with user interfaces or user-provided data. That is the point: it’s not because the code is in basic English simplified enough to fit in ASCII that you can ignore Unicode and don’t need to consider text encoding.
99% case being ASCII data generated by other software of unknown provenance. (Or sometimes by humans, but it's still data for machines, not for humans.)
Which is why you always type out addresses in ASCII representations in any foreign transactions even if it's not going to match your identity documents, unless the other party specifically demands it in UTF-8 and insists that they can handle it.
> it’s better if a Chinese user name does not break your reporting or logging systems
You should not be just casually dumping Chinese usernames into logs without warnings, in fact, you should not be using Chinese characters for usernames at all. Lots of Chinese online services exclusively use numeric IDs and e-mails for login IDs. "Usernames in natural human language" is a valid concept only in ASCII cultural sphere.
That is not always possible and the translation from local writing system to ASCII is often not unique and ambiguous. There really is no excuse for this sort of thinking. Even American programmers have to realise at some point that programs serve some purpose and that their failure to represent how the world works is just that: a failure. There is no excuse for programs to not support UTF-8 from user input to any output, including all the processing in between.
You might encounter tags like <html>, <HTML>, <Html>, etc., but you want to perform a hash table lookup.
So first you're going to normalize to either lower- or uppercase.
And to GP, SGML/HTML actually has a facility to define uppercasing rules beyond ASCII, namely the LCNMSTRT, UCNMSTRT, LCNMCHAR, UCNMCHAR options in the SYNTAX NAMING section in the SGML declaration introduced in the "Extended Naming Rules" revision of ISO 8879 (SGML std, cf. https://sgmljs.net/docs/sgmlrefman.html). Like basically everything else on this level, these rules are still used by HTML 5 to this date, and in particular, that while elements names can contain arbitrary characters, only those in the IRV (ASCII) get case-folded for canonization.
ANSI C was designed to be written by humans using a plain text editor. That doesn't make it a human language.
The other normal cases of string usage are file paths and user interface, and the needed operations can be done with simple string functions, and even in UTF8 encoding the characters you care about are in the ASCII range. With file paths the manipulations that you're most often doing is path based so you only care about '/', '\', ':', and '.' ASCII characters. With user interface elements you're likely to be using them as just static data and only substituting values into placeholders when necessary.
Why would you argue that? In my experience it's about formatting things that are addressed to the user, where the hardest and most annoying localization problems matter a lot. That includes sorting the last name "van den Berg" just after "Bakker", stylizing it as "Berg, van den", and making sure this capitalization is correct and not "Van Den Berg". There is no built in standard library function in any language that does any of that. It's so much larger than ascii and even larger than unicode.
Another user said that the main takeaway is that you can't process strings until you know their language (locale), and that is exactly correct.
Further I do say that if you're creating text for presenting to the user then the most common operation would be replacement of some field in pre-defined text.
In your case I would design it so that the correctly capitalised first name, surname, and variations of those for sorting would be generated at the data entry point (manually or automatically) and then just used when needed in user facing text generation. Therefore the only string operation needed would be replacement of placeholders like the fmt and standard library provide. This uses more memory and storage but these are cheaper now.
And as for data entry... Maybe in an ideal world. In the current world, marred by importing previously mangled datasets, a common solution in the few companies I've worked at is to just not do anything, which leaves ugly edges, yet is "good enough".
For example: https://en.m.wikipedia.org/wiki/Program_Files#Localization
So use standard string processing libraries on path names at your own peril.
It's a good idea to consider file paths as a bag of bytes.
I think I once hex-edited the FA-table to change a filename to have a lowercase name (or maybe it was disk corruption), trying to delete that file didn't work because it would be trying to delete "FOO", and couldn't find it because the file was named "FOo".
(Nitpick: sequence of bytes)
Also very limiting. If you do that, you can’t, for example, show a file name to the user as a string or easily use a shell to process data in your file system (do you type “/bin” or “\x2F\x62\x69\x6E”?)
Unix, from the start, claimed file names where byte sequences, yet assumed many of those to encode ascii.
That’s part of why Plan 9 made the choice “names may contain any printable character (that is, any character outside hexadecimal 00-1F and 80-9F)” (https://9fans.github.io/plan9port/man/man9/intro.html)
In Unix-land we don't use wchar_t or UTF-16, and his article is a good demonstrations of why not.
As in, there isn't even something on POSIX at the level other operaring systems support for localisation.
Yes there is some locale stuff, however not enough for all stuff, hence why every modern programming language happens to have this as part of their standard library.
And std::tolower/toupper is the wrong tool for that as well.
Just reading the title, with microsoft.com in bracket, I knew two things: 1. It would be written by Raymond Chen 2. That article is going to be awesome
For the former case, you don't need any complex logic. A very typical example would be: i'm serializing a field or constructing a url so I want the variable name "Someproperty" as a lower case string. The lowercase transform is completely naive. I know exactly what the range of possible characters are and they aren't going to be Turkish or emoji, not least because I have asserted they won't be. And THIS is what the regular programming functions for upper/lower case are for. They are important, and they are most often correct. Because for all the other cases (i18n, user input, ...) you probably don't want to do toUpper/toLower at all to begin with!
Example, if you present a message to the user from resources so your code is translate("USER_DIALOG_QUESTION_ABOUT_FISH") which you want to lookup knowing it will be in sentence case, and present as uppercase, what will you do? Here you likely can't, and shouldn't, do toUpper(translate(resourceKey)). Just use two resources if you want to correctly transform text. The toUpper function isn't made for this.
Trying to use a complex i18n-ready toUpper/toLower only helps part of the way. It still might not understand whether two S are contracted or whether something is a proper Noun and must stay capitalized. So it adds complexity and still isn't correct. Just use two resources!
C++ std::tolower/toupper (which are really just C tolower/toupper) are the wrong tool for that too though because they depend on the process locale which makes them a) horribly inefficient and b) prone to blow your program up in interesting ways on customer systems. Not quite as bad as the locale-dependent standard number parsing functions that want . in some localses and , in others but still should never be used.
But isn't it also dependent on the available glyphs in the font used? So f.e. it needs to be ensured that U+1E9E exists?
> "Since 2024 the capital ⟨ẞ⟩ is preferred over ⟨SS⟩."
https://en.wikipedia.org/wiki/%C3%9F
Check reference #5 and compare it to the older wording in reference #12.
And thus we add country specific locale to the party.
I'm not sure about other languages, but Swift has pretty intense String support[0], and can go quite a long ways.
Someone actually wrote a whole book about just Swift Strings[1].
[0] https://docs.swift.org/swift-book/documentation/the-swift-pr...
That's incorrect, using diacritics on capital letters is always the preferred form, it's just that dropping them is acceptable as it was often done for technical reasons.
As for UTF-16, well, I don't know that UTF-8 is a whole lot more intuitive:
> And for UTF-8 data, you have the same issues discussed before: Multibyte characters will not be converted properly, and it breaks for case mappings that alter string lengths.
Furthermore, the proper way to do case folding will depend on such things as the character set, the language, the specific context of the text being converted (e.g. in some cases specific letters are required, such as abbreviations of the names of SI units), etc. And then, it is not necessarily only "uppercase" and "lowercase", anyways.
There might even be different ways to do by the same language, with possibly disagreements about usage (e.g. the German Eszett did not have an official capital form until 2017, although apparently some type designers did it anyways (and it was in Unicode before then, despite that)).
If the character set is Unicode, then there is not actually the correct way to do it, despite what the Unicode Conspiracy insists otherwise.
Also, for some uses the way that it will need to be done, there will be a specific way that it is required (due to the way that a file format or a protocol or whatever is working), so in such a case if the character set is something other than ASCII then you cannot just assume that it will always work in the same way.
You also cannot necessarily depend on the locale for such a thing, since it might depend on the data, as well.
These things can be as bad as they are, but Unicode just makes these things worse than that. If a program requires a specific case folding and then it will not work because it is the wrong version of Unicode and it is possible to be a security issue and/or other problems.
(Another problem, which applies even if you do not use case folding, is that some people think that all text is or should be Unicode and that one character set is suitable for everything. Actually, one character set cannot be suitable for everything, regardless of what character set it is. Even if it was (which it isn't), it wouldn't be Unicode.)
People always had some fancy reasoning about why things that should just work are not, but then a few years pass and things are improved.
C++ is getting closer and closer to langs like C# in terms of making it harder to shot yourself, but still there is a huge room for improvement
> If you need to perform a case mapping on a string, you can use LCMapStringEx with LCMAP_LOWERCASE or LCMAP_UPPERCASE, possibly with other flags like LCMAP_LINGUISTIC_CASING. If you use the International Components for Unicode (ICU) library, you can use u_strToUpper and u_strToLower.
All other options are going to result in edge cases where you're not handling it properly. It's like trying to programmatically split a full name into a first name and a last name: language doesn't work like that.
for (int i = 0; i < strlen(s); i++) {
s[i] ^= 0x20;
}
Jokes aside, I was kinda hoping for a good answer that doesn't rely on a Windows API or an external library, but I'm not sure there is one. It's a rather complex problem when you account for more than just ASCII and the English language.
Man, I'm happy we don't need to deal with this crap in Rust, and we can just use String::to_lowercase. Not having to worry about things makes coding fun.
Qt does have a locale-aware equivalent (QLocale::toUpper/toLower) which calls out to ICU if available. Otherwise it falls back to the QString functions, so you have to be confident about how your build is configured. Whether it works or not has very little to do with the design of QString.