If the characters were instead encoded like EBML's variable size integers[1] (but inverting 1 and 0 to keep ASCII compatibility for the single-byte case), and you do a random seek, it wouldn't be as easy (or maybe not even possible) to know if you landed on the beginning of a character or in one of the `xxxx xxxx` bytes.
Python has had troubles in this area. Because Python strings are indexable by character, CPython used wide characters. At one point you could pick 2-byte or 4-byte characters when building CPython. Then that switch was made automatic at run time. But it's still wide characters, not UTF-8. One emoji and your string size quadruples.
I would have been tempted to use UTF-8 internally. Indices into a string would be an opaque index type which behaved like an integer to the extent that you could add or subtract small integers, and that would move you through the string. If you actually converted the opaque type to a real integer, or tried to subscript the string directly, an index to the string would be generated. That's an unusual case. All the standard operations, including regular expressions, can work on a UTF-8 representation with opaque index objects.
https://peps.python.org/pep-0393/
I would probably use UTF-8 and just give up on O(1) string indexing if I were implementing a new string type. It's very rare to require arbitrary large-number indexing into strings. Most use-cases involve chopping off a small prefix (eg. "hex_digits[2:]") or suffix (eg. "filename[-3:]"), and you can easily just linear search these with minimal CPU penalty. Or they're part of library methods where you want to have your own custom traversals, eg. .find(substr) can just do Boyer-Moore over bytes, .split(delim) probably wants to do a first pass that identifies delimiter positions and then use that to allocate all the results at once.
I agree though that usually you only need iteration, but string APIs need to change to return some kind of token that encapsulates both logical and physical index. And you probably want to be able to compute with those - subtract to get length and so on.
There are a variety of reasons why unsafe byte indexing is needed anyway (zero-copy?), it just shouldn’t be the default tool that application programmers reach for.
UTF8 is used for C level interactions, if it were just that being used there would be no need to know the highest code point.
For Python semantics it uses one of ASCII, iso-8859-1, ucs2, or ucs4.
https://github.com/python/cpython/blob/main/Objects/unicodeo...
Also implies that Animats is correct that including an emoji in a Python string can bloat the memory consumption by a factor of 4.
In all seriousness I think that encoding-independent constant-time substring extraction has been meaningful in letting researchers outside the U.S. prototype, especially in NLP, without worrying about their abstractions around “a 5 character subslice” being more complicated than that. Memory is a tradeoff, but a reasonably predictable one.
Combining characters still exist.
Programmer strings (aka byte strings) do need indexing operations. But such strings usually do not need Unicode.
That's the other part of the resume UTF8 strings mid way, even combining broken strings still results in all the good characters present.
Substring operations are more dicey; those should be operating with known strings. In pathological cases they might operate against portions of Unicode bits... but that's as silly as using raw pointers and directly mangling the bytes without any protection or design plans.
What conversion rule do you want to use, though? You either reject some values outright, bump those up or down, or else start with a character index that requires an O(N) translation to a byte index.
> ascii and codepage encodings are legacy, let's standardize on another forwards-incompatible standard that will be obsolete in five years > oh, and we also need to upgrade all our infrastructure for this obsolete-by-design standard because we're now keeping it forever
UCS-2 was an encoding mistake, but even it was pretty forward compatible
Yes, it's a silly idea but it's exactly the reason why Python, Javascript and Java use the most brainded way of storing text known to man. (UCS-2)
Well... it explicitly wasn't supposed to fit all past characters when they decided on 16 bits.
And they weren't sure on size for a while, and only kept it for a couple years, so I would make the fact that you're complaining about the 16 bits more explicit.
But also it did turn out to be forward compatible. That's part of why we're stuck with it!
The difference between VLQ and LEB128 is endianness, basically whether the zero MSB is the start or end of a sequence.
0xxxxxxx - ASCII
1xxxxxxx 0xxxxxxx - U+0080 .. U+3FFF
1xxxxxxx 1xxxxxxx 0xxxxxxx - U+4000 .. U+10FFFD
0xxxxxxx - ASCII
0xxxxxxx 1xxxxxxx - U+0080 .. U+3FFF
0xxxxxxx 1xxxxxxx 1xxxxxxx - U+4000 .. U+10FFFD
It's not self-synchronizing like UTF-8, but it's more compact - any unicode codepoint can fit into 3 bytes (which can encode up to 0x1FFFFF), and ASCII characters remain 1 byte. Can grow to arbitrary sizes. It has a fixed overhead of 1/8, whereas UTF-8 only has overhead of 1/8 for ASCII and 1/3 thereafter. Could be useful compressing the size of code that uses non-ASCII, since most of the mathematical symbols/arrows are < U+3FFF. Also languages like Japanese, since Katakana and Hiragana are also < U+3FFF, and could be encoded in 2 bytes rather than 3. | Header | Total Bytes | Payload Bits |
| ---------- | ----------- | ------------ |
| `.......1` | 1 | 7 |
| `......10` | 2 | 14 |
| `.....100` | 3 | 21 |
| `....1000` | 4 | 28 |
| `...10000` | 5 | 35 |
| `..100000` | 6 | 42 |
| `.1000000` | 7 | 49 |
| `10000000` | 8 | 56 |
| `00000000` | 9 | 64 |
The full value is stored little endian, so you simply read the first byte (low byte) in the stream to get the full length, and it has the exact same compactness of VLQ/LEB128 (7 bits per byte).Even better: modern chips have instructions that decode this field in one shot (callable via builtin):
https://github.com/kstenerud/ksbonjson/blob/main/library/src...
static inline size_t decodeLengthFieldTotalByteCount(uint8_t header) {
return (size_t)__builtin_ctz(header) + 1;
}
After running this builtin, you simply re-read the memory location for the specified number of bytes, then cast to a little-endian integer, then shift right by the same number of bits to get the final payload - with a special case for `00000000`, although numbers that big are rare. In fact, if you limit yourself to max 56 bit numbers, the algorithm becomes entirely branchless (even if your chip doesn't have the builtin).https://github.com/kstenerud/ksbonjson/blob/main/library/src...
It's one of the things I did to make BONJSON 35x faster to decode/encode compared to JSON.
https://github.com/kstenerud/bonjson
If you wanted to maintain ASCII compatibility, you could use a 0-based unary code going left-to-right, but you lose a number of the speed benefits of a little endian friendly encoding (as well as the self-synchronization of UTF-8 - which admittedly isn't so important in the modern world of everything being out-of-band enveloped and error-corrected). But it would still be a LOT faster than VLQ/LEB128.
a rough implementation is not hard. (for writing, my implementation will write BOM in beginning and only do 28bits)
https://github.com/roytam1/rtoss/commit/b09bd53d7f4166f34c8b...
We'd use `vpmovb2m`[1] on a ZMM register (64-bytes at a time), which fills a 64-bit mask register with the MSB of each byte in the vector.
Then process the mask register 1 byte at a time, using it as an index into a 256-entry jump table. Each entry would be specialized to process the next 8 bytes without branching, and finish with conditional branch to the next entry in the jump table or to the next 64-bytes. Any trailing ones in each byte would simply add them to a carry, which would be consumed up to the most significant zero in the next eightbytes.
[1]:https://www.intel.com/content/www/us/en/docs/intrinsics-guid...
While you might be able to have some heuristic to determine whether a character is a valid match, it may give false positives and it's unlikely to be as efficient as "test if the previous byte's MSB is zero". We can implement parallel search with VLQs because we can trivially synchronize the stream to next nearest character in either direction - it's partially-synchronizing.
Obviously not as good as UTF-8 or UTF-16 which are self-synchronizing, but it can be implemented efficiently and cut encoding size.
Quick googling (not all of them are on-topic tho):
https://www.rapid7.com/blog/post/2025/02/13/cve-2025-1094-po...
You are correct that it never occurs at the start of a byte that isn’t a continuation bytes: the first byte in each encoded code point starts with either 0 (ASCII code points) or 11 (non-ASCII).
https://en.wikipedia.org/wiki/Unary_numeral_system
and also use whatever bits are left over encoding the length (which could be in 8 bit blocks so you write 1111/1111 10xx/xxxx to code 8 extension bytes) to encode the number. This is covered in this CS classic
https://archive.org/details/managinggigabyte0000witt
together with other methods that let you compress a text + a full text index for the text into less room than text and not even have to use a stopword list. As you say, UTF-8 does something similar in spirit but ASCII compatible and capable of fast synchronization if data is corrupted or truncated.
You mean codepoints or maybe grapheme clusters?
Anyways yeah it’s a little more complicated but the principle of being able to truncate a string without splitting a codepoint in O(1) is still useful
> truncate a string without splitting a codepoint in O(1) is still useful
Agreed!
I wonder if a reason is similar though: error recovery when working with libraries that aren't UTF-8 aware. If you slice naively slice an array of UTF-8 bytes, a UTf-8 aware library can ignore malformed leading and trailing bytes and get some reasonable string out of it.
Or you accept that if you're randomly losing chunks, you might lose an extra 3 bytes.
The real problem is that seeking a few bytes won't work with EMBL. If continuation bytes store 8 payload bits, you can get into a situation where every single byte could be interpreted as a multi-byte start character and there are 2 or 3 possible messages that never converge.
But it doesn't matter if it takes 1 byte or 3 bytes to synchronize. And being unable to read backwards is not a problem.
(EMBL doesn't synchronize in three bytes but other encodings do.)
Given four byte maximum, it's a similarly trivial algo for the other case you mention.
The main difference I see is that UTF8 increases the chance of catching and flagging an error in the stream. E.g., any non-ASCII byte that is missing from the stream is highly likely to cause an invalid sequence. Whereas with the other case you mention the continuation bytes would cause silent errors (since an ASCII character would be indecipherable from continuation bytes).
Encoding gurus-- am I right?
It is not true [1]. While it is not UTF-8 problem per se, it is a problem of how UTF-8 is being used.
[1] https://paulbutler.org/2025/smuggling-arbitrary-data-through...
what you describe is the bare minimum so you even know what you are searching for while you scan pretty much everything every time.
UTF-8 didn't win on technical merits, it won becausw it was mostly backwards compatible with all American software that previously used ASCII only.
When you leave the anglosphere you'll find that some languages still default to other encodings due to how large utf-8 ends up for them (Chinese and Japanese, to name two).
UTF-8 and UTF-16 take the same number of characters to encode a non-BMP character or a character in the range U+0080-U+07FF (which includes most of the Latin supplements, Greek, Cyrillic, Arabic, Hebrew, Aramaic, Syriac, and Thaana). For ASCII characters--which includes most whitespace and punctuation--UTF-8 takes half as much space as UTF-16, while characters in the range U+0800-U+FFFF, UTF-8 takes 50% more space than UTF-16. Thus, for most European languages, and even Arabic (which ain't European), UTF-8 is going to be more compact than UTF-16.
The Asian languages (CJK-based languages, Indic languages, and South-East Asian, largely) are the ones that are more compact in UTF-16 than UTF-8, but if you embed those languages in a context likely to have significant ASCII content--such as an HTML file--well, it turns out the UTF-8 still wins out!
> When you leave the anglosphere you'll find that some languages still default to other encodings due to how large utf-8 ends up for them (Chinese and Japanese, to name two).
You'll notice that the encodings that are used are not UTF-16 either. Also, my understanding is that China generally defaults to UTF-8 nowadays despite a government mandate to use GB18030 instead, so it's largely Japan that is the last redoubt of the anti-Unicode club.
UTF-32 would be a fair comparison, but it is 4 bytes per character and I don't know what, if anything, uses it.
Spanish has generally at most one accented vowel (á, ó, ü, é, ...) per word, and generally at most one ñ per word. German rarely has more than two umlauts per word, and almost never more than one ß.
UTF-16 is a wild pessimization for European languages, and UTF-8 is only slightly wasteful in Asian languages.
And unlike the short-sighted authors of the first version of Unicode, who thought the whole world's writing systems could fit in just 65,536 distinct values, the authors of UTF-8 made it possible to encode up to 2 billion distinct values in the original design.
Now all of this hating on UTF-16 should not be misconstrued as some sort of encoding religious war. UTF-16 has a valid purpose. The real problem was Unicode's first version getting released at a critical time and thus its 16-bit delusion ending up baked into a bunch of important software. UTF-16 is a pragmatic compromise to adapt that software so it can continue to work with a larger code space than it originally could handle. Short of rewiting history, it will stay with us forever. However, that doesn't mean it needs to be transmitted over the wire or saved on disk any more often than necessary.
Use UTF-8 for most purposes especially new formats, use UTF-16 only when existing software requires it, and use UTF-32 (or some other sequence of full code points) only internally/ephemerally to convert between the other two and perform high-level string functions like grapheme cluster segmentation.
A true flaw of UTF-8 in the long run. They should have biased the values of multibyte sequences to remove redundant encodings.
EDIT: Heh. The U+1F4A9 emoji that I included in my comment was stripped out. For those who don't recognize that codepoint by hand (can't "see" the Matrix just from its code yet?), that emoji's official name is U+1F4A9 PILE OF POO.
It is 33% more compact for most (but not all) CJK characters, but that's not the case for all non-English characters. However, one important thing to remember is that most computer-based documents contain large amounts of ASCII text purely because the formats themselves use English text and ASCII punctuation. I suspect that most UTF-8 files with CJK contents are much smaller than UTF-16 files, but I'd be interested in an actual analysis from different file formats.
The size argument (along with a lot of understandable contention around UniHan) is one of the reasons why UTF-8 adoption was slower in Japan and Shift-JIS is not completely dead (though mainly for esoteric historical reasons like the 漢検 test rather than active or intentional usage) but this is quite old history at this point. UTF-8 now makes up 99% of web pages.
You could argue that because it will be compressed (and UTF-16 wastes a whole NUL byte for all ASCII) that the total file-size for the compressed version would be better (precisely because there are so many wasted bytes) but there are plenty of examples where files aren't compressed and most systems don't have compressed memory so you will pay the cost somewhere.
But in the interest of transparency, a very crude test of the same ePUB yields a 10% smaller file with UTF-16. I think a 10% size penalty (in a very favourable scenario for UTF-16) in exchange for all of the benefits of UTF-8 is more than an acceptable tradeoff, and the incredibly wide proliferation of UTF-8 implies most people seem to agree.
Both UTF-8 and UTF-16 have negatives but I don't think UTF-16 comes out ahead.
1. Invalid bytes. Some bytes cannot appear in an UTF-8 string at all. There are two ranges of these.
2. Conditionally invalid continuation bytes. In some states you read a continuation byte and extract the data, but in some other cases the valid range of the first continuation byte is further restricted.
3. Surrogates. They cannot appear in a valid UTF-8 string, so if they do, this is an error and you need to mark it so. Or maybe process them as in CESU but this means to make sure they a correctly paired. Or maybe process them as in WTF-8, read and let go.
4. Form issues: an incomplete sequence or a continuation byte without a starting byte.
It is much more complicated than UTF-16. UTF-16 only has surrogates that are pretty straightforward.
UTF-16 is simple as well but you still need code to absorb BOMs, perform endian detection heuristically if there's no BOM, and check surrogate ordering (and emit a U+FFFD when an illegal pair is found).
I don't think there's an argument for either being complex, the UTFs are meant to be as simple and algorithmic as possible. -8 has to deal with invalid sequences, -16 has to deal with byte ordering, other than that it's bit shifting akin to base64. Normalization is much worse by comparison.
My preference for UTF-8 isn't one of code complexity, I just like that all my 70's-era text processing tools continue working without too many surprises. The features like self-synchronization are nice too compared to what we _could_ have gotten as UTF-8.
Was this just historical luck? Is there a world where the designers of ASCII grabbed one more bit of code space for some nice-to-haves, or did they have code pages or other extensibility in mind from the start? I bet someone around here knows.
In a way, UTF-8 is just one of many good uses for that spare 8th bit in an ASCII byte...
I thought it was normally six 6bit characters?
... However I'm not sure how much I trust it. It says that 5x7 was "the usual PDP-6/10 convention" and was called "five-seven ASCII", but I can't find the phrase "five-seven ASCII" anywhere on Google except for posts quoting that Wikipedia page. It cites two references, neither of which contain the phrase "five-seven ascii".
Though one of the references (RFC 114, for FTP) corroborates that PDP-10 could use 5x7:
[...] For example, if a
PDP-10 receives data types A, A1, AE, or A7, it can store the
ASCII characters five to a word (DEC-packed ASCII). If the
datatype is A8 or A9, it would store the characters four to a
word. Sixbit characters would be stored six to a word.
To me, it seems like 5x7 was one of multiple conventions you could store character data in a PDP-10 (and probably other 36-bit machines), and Wikipedia hallucinated that the name for this convention is "five-seven ASCII". (For niche topics like this, I sometimes see authors just stating their own personal terminology for things as a fact; be sure to check sources!).[1] https://en.wikipedia.org/w/index.php?title=36-bit_computing&...
ASCII has its roots in teletype codes, which were a development from telegraph codes like Morse.
Morse code is variable length, so this made automatic telegraph machines or teletypes awkward to implement. The solution was the 5 bit Baudot code. Using a fixed length code simplified the devices. Operators could type Baudot code using one hand on a 5 key keyboard. Part of the code's design was to minimize operator fatigue.
Baudot code is why we refer to the symbol rate of modems and the like in Baud btw.
Anyhow, the next change came with instead of telegraph machines directly signaling on the wire, instead a typewriter was used to create a punched tape of codepoints, which would be loaded into the telegraph machine for transmission. Since the keyboard was now decoupled from the wire code, there was more flexibility to add additional code points. This is where stuff like "Carriage Return" and "Line Feed" originate. This got standardized by Western Union and internationally.
By the time we get to ASCII, teleprinters are common, and the early computer industry adopted punched cards pervasively as an input format. And they initially did the straightforward thing of just using the telegraph codes. But then someone at IBM came up with a new scheme that would be faster when using punch cards in sorting machines. And that became ASCII eventually.
So zooming out here the story is that we started with binary codes, then adopted new schemes as technology developed. All this happened long before the digital computing world settled on 8 bit bytes as a convention. ASCII as bytes is just a practical compromise between the older teletype codes and the newer convention.
Technically, the punch card processing technology was patented by inventor Herman Hollerith in 1884, and the company he founded wouldn't become IBM until 40 years later (though it was folded with 3 other companies into the Computing-Tabulating-Recording company in 1911, which would then become IBM in 1924).
To be honest though, I'm not clear how ASCII came from anything used by the punch card sorting machines, since it wasn't proposed until 1961 (by an IBM engineer, but 32 years after Hollerith's death). Do you know where I can read more about the progression here?
> Work on the ASCII standard began in May 1961, when IBM engineer Bob Bemer submitted a proposal to the American Standards Association's (ASA) (now the American National Standards Institute or ANSI) X3.2 subcommittee.[7] The first edition of the standard was published in 1963,[8] contemporaneously with the introduction of the Teletype Model 33. It later underwent a major revision in 1967,[9][10] and several further revisions until 1986.[11] In contrast to earlier telegraph codes such as Baudot, ASCII was ordered for more convenient collation (especially alphabetical sorting of lists), and added controls for devices other than teleprinters.[11]
Beyond that I think you'd have to dig up the old technical reports.
> The base EBCDIC characters and control characters in UTF-EBCDIC are the same single byte codepoint as EBCDIC CCSID 1047 while all other characters are represented by multiple bytes where each byte is not one of the invariant EBCDIC characters. Therefore, legacy applications could simply ignore codepoints that are not recognized.
Dear god.
"The base ASCII characters and control characters in UTF-8 are the same single byte codepoint as ISO-8859-1 while all other characters are represented by multiple bytes where each byte is not one of the invariant ASCII characters. Therefore, legacy applications could simply ignore codepoints that are not recognized."
(I know nothing of EBCDIC, but this seems to mirror UTF-8 design)
This lives on in compose key sequences, so instead of a BS ' one types compose-' a and so on.
And this all predates ASCII: it's how people did accents and such on typewriters.
This is also why Spanish used to not use accents on capitals, and still allows capitals to not have accents: that would require smaller capitals, but typewriters back then didn't have them.
The accident of history is less that ASCII happens to be 7 bits, but that the relevant phase of computer development happened to primarily occur in an English-speaking country, and that English text happens to be well representable with 7-bit units.
This is easily proven by the success of all the ISO-8859-*, Windows and IBM CP-* encodings, and all the *SCII (ISCII, YUSCII...) extensions — they fit one or more languages in the upper 128 characters.
It's mostly CJK out of large languages that fail to fit within 128 characters as a whole (though there are smaller languages too).
Before ASCII there was BCDIC, which was six bits and non-standardized (there were variants, just like technically there are a number of ASCII variants, with the common just referred to as ASCII these days).
BCDIC was the capital English letters plus common punctuation plus numbers. 2^6 is 64, and for capital letters + numbers, you have 36, plus a few common punctuation marks puts you around 50. IIRC the original by IBM was around 45 or something. Slash, period, comma, tc.
So when there was a decision to support lowercase, they added a bit because that's all that was necessary, and I think the printers around at the time couldn't print anything but something less than 128 characters anyway. There wasn't any ó or ö or anything printable, so why support it?
But eventually that yielded to 8-bit encodings (various ASCIIs like latin-1 extended, etc. that had ñ etc.).
Crucially, UTF-8 is only compatible with the 7-bit ASCII. All those 8-bit ASCIIs are incompatible with UTF-8 because they use the eighth bit.
IBM had standardized 8-bit bytes on their System/360, so they developed the 8-bit EBCDIC encoding. Other computing vendors didn't have consistent byte lengths... 7-bits was weird, but characters didn't necessarily fit nicely into system words anyway.
It's not like 5-bit codes forgot about numbers and 80% of punctuation, or like 6-bit codes forgot about having upper and lower case letters. They were clearly 'insufficient' for general text even as the tradeoff was being made, it's just that each bit cost so much we did it anyway.
The obvious baseline by the time we were putting text into computers was to match a typewriter. That was easy to see coming. And the symbols on a typewriter take 7 bits to encode.
Typewriters have some statefullness, too, like "shift lock". Baudot needed to encode the actions of a type writer to control it, not the output.
Crucially, "the 7-bit coded character set" is described on page 6 using only seven total bits (1-indexed, so don't get confused when you see b7 in the chart!).
There is an encoding mechanism to use 8 bits, but it's for storage on a type of magnetic tape, and even that still is silent on the 8th bit being repurposed. It's likely, given the lack of discussion about it, that it was for ergonomic or technical purposes related to the medium (8 is a power of 2) rather than for future extensibility.
So, it seems that ASCII was kept to 7 bits primarily so "extended ASCII" sets could exist, with additional characters for various purposes (such as other languages, but also for things like mathematical symbols).
https://hcs64.com/files/Mackenzie%20-%20Coded%20Character%20... sections 13.6 and 13.7
Looks to me like serendipity - they thought 8 bits would be wasteful, they didnt have a need for that many characters.
Coming at it naively, people might think the scope is something like "all sufficiently widespread distinct, discrete glyphs used by humans for communication that can be printed". But that's not true, because
* It's not discrete. Some code points are for combining with other code points.
* It's not distinct. Some glyphs can be written in multiple ways. Some glyphs which (almost?) always display the same, have different code points and meanings.
* It's not all printable. Control characters are in there - they pretty much had to be due to compatibility with ASCII, but they've added plenty of their own.
I'm not aware of any Unicode code points that are animated - at least what's printable, is printable on paper and not just on screen, there are no marquee or blink control characters, thank God. But, who knows when that invariant will fall too.
By the way, I know of one utf encoding the author didn't mention, utf-7. Like utf-8, but assuming that the last bit wasn't safe to use (apparently a sensible precaution over networks in the 80s). My boss managed to send me a mail encoded in utf-7 once, that's how I know what it is. I don't know how he managed to send it, though.
They should add separate code points for each variant and at least make it possible to avoid the problem in new documents. I've heard the arguments against this before, but the longer you wait, the worse the problem gets.
Chinese textbook: <ch>Chinese <jp>Mixed Japanese</jp> continue Chinese.</ch>
There is also UTF-9, from an April Fools RFC, meant for use on hosts with 36-bit words such as the PDP-10.
This isn't "scope creep". It's a reflection of reality. People were already constructing compositions like this is real life. The normalization problem was unavoidable.
https://research.swtch.com/utf8
And Rob Pike's description of the history of how it was designed:
Of course it's Pike and Thompson and the gang. The amount of contributions these guys made to the world of computing is insane.
So why not make the alternatives impossible by adding the start of the last valid option? So 11000000 10000001 would give codepoint 128+1 as values 0 to 127 are already covered by a 1 byte sequence.
The advantages are clear: No illegal codes, and a slightly shorter string for edge cases. I presume the designers thought about this, so what were the disadvantages? The required addition being an unacceptable hardware cost at the time?
UPDATE: Last bitsequence should of course be 10000001 and not 00000001. Sorry for that. Fixed it.
Why is U+0080 encoded as c2 80, instead of c0 80, which is the lowest sequence after 7f?
I suspect the answer is
a) the security impacts of overlong encodings were not contemplated; lots of fun to be had there if something accepts overlong encodings but is scanning for things with only shortest encodings
b) utf-8 as standardized allows for encode and decode with bitmask and bitshift only. Your proposed encoding requires bitmask and bitshift, in addition to addition and subtraction
You can find a bit of email discussion from 1992 here [1] ... at the very bottom there's some notes about what became utf-8:
> 1. The 2 byte sequence has 2^11 codes, yet only 2^11-2^7 are allowed. The codes in the range 0-7f are illegal. I think this is preferable to a pile of magic additive constants for no real benefit. Similar comment applies to all of the longer sequences.
The included FSS-UTF that's right before the note does include additive constants.
I've seen the first part of that mail, but your version is a lot longer. It is indeed quite convincing in declaring b) moot. And security was not that big of a thing then as it is now, so you're probalbly right
I get what you mean, in terms of Postel's Law, e.g., software that is liberal in what it accepts should view 01001000 01100101 01101010 01101010 01101111 as equivalent to 11000001 10001000 11000001 10100101 11000001 10101010 11000001 10101010 11000001 10101111, despite the sequence not being byte-for-byte identical. I'm just not convinced Postel's Law should be applied wrt UTF-8 code units.
Yes, software shouldn’t accept overlong encodings, and I was pointing out another bad thing that can happen with software that does accept overlong encodings, thereby reinforcing the advice to not accept them.
It also notes that UTF-8 protects against the dangers of NUL and '/' appearing in filenames, which would kill C strings and DOS path handling, respectively.
In theory you could do it that way, but it comes at the cost of decoder performance. With UTF-8, you can reassemble a codepoint from a stream using only fast bitwise operations (&, |, and <<). If you declared that you had to subtract the legal codepoints represented by shorter sequences, you'd have to introduce additional arithmetic operations in encoding and decoding.
It sacrifices the ability to encode more than 21 bits, which I believe was done for compatibility with UTF-16: UTF-16’s awful “surrogate” mechanism can only express code units up to 2^21-1.
I hope we don’t regret this limitation some day. I’m not aware of any other material reason to disallow larger UTF-8 code units.
Or utf-16 is officially considered a second class citizen, and some code points are simply out of its reach.
Even with all Chinese characters, de-unified, all the notable historical and constructed scripts, technical symbols, and all the submitted emoji, including rejections, you are still way short of a million.
We are probably never need more than 21 bits unless we start stretching the definition of what text is.
The exact number is 1112064 = 2^16 - 2048 + 16*2^16: in UTF-16, 2 bytes can encode 2^16 - 2048 code points, and 4 bytes can encode 16*2^16 (the 2048 surrogates are not counted because they can never appear by themselves, they're used purely for UTF-16 encoding).
Yes, it is 'truncated' to the "UTF-16 accessible range":
* https://datatracker.ietf.org/doc/html/rfc3629#section-3
* https://en.wikipedia.org/wiki/UTF-8#History
Thompson's original design could handle up to six octets for each letter/symbol, with 31 bits of space:
Edit: just tested this, Perl still allows this, but with an extra twist: v-notation goes up to 2^63-1. From 2^31 to 2^36-1 is encoded as FE + 6 bytes, and everything above that is encoded as FF + 12 bytes; the largest value it allows is v9223372036854775807, which is encoded as FF 80 87 BF BF BF BF BF BF BF BF BF BF. It probably doesn't allow that one extra bit because v-notation doesn't work with negative integers.
If I had to guess, I'd say we'll run out of IPv6 addresses before we run out of unassigned UTF-8 sequences.
No, UTF-8's design can encode up to 31 bits of codepoints. The limitation to 21 bits comes from UTF-16, which was then adopted for UTF-8 too. When UTF-16 dies we'll be able to extend UTF-8 (well, compatibility will be a problem).
In addition, it would be possible to nest another surrogate-character-like scheme into UTF-16 to support a larger character set.
It's less fun when you have things that need to keep working break because someone felt like renaming a parameter, or that a part of the standard library looks "untidy"
Honestly python is probably one of the worst offender in this as they combine happily making breaking changes for low value rearranging of deck chairs with a dynamic language where you might only find out in runtime.
The fact that they've also decided to use an unconventional intepretation of minor version shows how little they care.
There were apps that completely rejected non-7-bit data back in the day. Backwards compatibility wasn't the only point. The point of UTF-8 is more (IMO) that UTF-32 is too bulky, UCS-2 was insufficient, UTF-16 was an abortion, and only UTF-8 could have the right trade-offs.
Would be great if it was possible to enter codepoints directly; you can do it via the URL (`/F8FF` eg), but not in the UI. (Edit, the future is now. https://github.com/vishnuharidas/utf8-playground/pull/6)
https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...
So I went around fixing UnicodeErrors in Python at random, for years, despite knowing all that stuff. It wasn't until I read Batchelder's piece on the "Unicode Sandwich," about a decade later that I finally learned how to write a program to support it properly, rather than playing whack-a-mole.
Is this the piece you mean? https://nedbatchelder.com/text/unipain.html
The only problem with UTF-8 is that Windows and Java were developed without knowledge about UTF-8 and ended up with 16-bit characters.
Oh yes, and Python 3 should have known better when it went through the string-bytes split.
As Unicode (quickly) evolved, it turned out not that only are there WAY more than 65,000 characters, there's not even a 1:1 relationship between code points and characters, or even a single defined transformation between glyphs and code points, or even a simple relationship between glyphs and what's on the screen. So even UTF-32 isn't enough to let you act like it's 1980 and str[3] is the 4th "character" of a string.
So now we have very complex string APIs that reflect the actual complexity of how human language works...though lots of people (mostly English-speaking) still act like str[3] is the 4th "character" of a string.
UTF-8 was designed with the knowledge that there's no point in pretending that string indexing will work. Windows, MacOS, Java, JavaScript, etc. just missed the boat by a few years and went the wrong way.
This "two bytes should be enough" mistake was one of the biggest blind spots in Unicode's original design, and is cited as an example of how standards groups can have cultural blind spots.
However, it's not used widely and has problems with variant-naïve fonts.
This week's Unicode 17 announcement [1] mentions that of the ~160k existing codepoints, over 100k are CJK codepoints, so I don't think this can be true...
[1] https://blog.unicode.org/2025/09/unicode-170-release-announc...
But what if instead of emojis we take the CJK set and make it more compositional. Instead of >100k characters with different glyphs we could have defined a number of brush stroke characters and compositional characters (like "three of the previous character in a triangle formation). We could still make distinct code points for the most common couple thousand characters, just like ä can be encoded as one code point or two (umlaut dots plus a).
Alas, in the 90s this would have been seen as too much complexity
Ideographic Description Characters: https://www.unicode.org/charts/PDF/U2FF0.pdf
The fine people over at Wenlin actually have a renderer that generates characters based on this sort of programmatic definition, their Character Description Language: https://guide.wenlininstitute.org/wenlin4.3/Character_Descri... ... in many cases, they are the first digital renderer for new characters that don't yet have font support.
Another interesting bit, the Cantonese linguist community I regularly interface with generally doesn't mind unification. It's treated the same as a "single-storey a" (the one you write by hand) and a "two-storey a" (the one in this font). Sinitic languages fractured into families in part because the graphemes don't explicitly encode the phonetics + physical distance, and the graphemes themselves fractured because somebody's uncle had terrible handwriting.
I'm in Hong Kong, so we use 説 (8AAC, normalized to 8AAA) while Taiwan would use 說 (8AAA). This is a case my linguist friends consider a mistake, but it happened early enough that it was only retroactively normalized. Same word, same meaning, grapheme distinct by regional divergence. (I think we actually have three codepoints that normalize to 8AAA because of radical variations.)
The argument basically reduces "should we encode distinct graphemes, or distinct meanings." Unicode has never been fully-consistent on either side of that. The latest example, we're getting ready to do Seal Script as a separate non-unified code point. https://www.unicode.org/roadmaps/tip/
In Hong Kong, some old government files just don't work unless you have the font that has the specific author's Private Use Area mapping (or happen to know the source encoding and can re-encode it). I've regularly had to pull up old Windows in a VM to grab data about old code pages.
In short: it's a beautiful mess.
The grande crime was that we squandered the space we were given by placing emojis outside the UTF-8 specification, where we already had a whooping 1.1 million code points at our disposal.
I'm not sure what you mean by this. The UTF-8 specification was written long before emoji were included in Unicode, and generally has no bearing on what characters it's used to encode.
https://commandcenter.blogspot.com/2020/01/utf-8-turned-20-y...
Even for identifiers you probably want to do all kinds of normalization even beyond the level of UTF-8 so things like overlong sequences and other errors are really not an inherent security issue.
Unicode does have a completely defined way to interpret invalid UTF-8 byte sequences by replacing them with the U+FFFD ("replacement character"). You'll see it used (for example) in browsers all the time.
Mandating acceptance for every invalid input works well for HTML because it's meant to be consumed (primarily) by humans. It's not done for UTF-8 because in some situations it's much more useful to detect and report errors instead of making an automatic correction that can't be automatically detected after the fact.
This is not a wart. And how to interpret them is not undefined -- you're just not allowed to interpret them as _characters_.
There is right now a discussion[0] about adding a garbage-in/garbage-out mode to jq/jaq/etc that allows them to read and output JSON with invalid UTF-8 strings representing binary data in a way that round-trips. I'm not for making that the default for jq, and you have to be very careful about this to make sure that all the tools you use to handle such "JSON" round-trip the data. But the clever thing is that the proposed changes indeed do not interpret invalid byte sequences as character data, so they stay within the bounds of Unicode as long as your terminal (if these binary strings end up there) and other tools also do the same.
UTF-8 made processing Japanese text much easier! No more needing to manually change encoding options in my browser! No more mojibake!
A couple of days later, I got an email from someone explaining that it was gibberish — apparently our content partner who claimed to be sending GB2312 simplified Chinese was in fact sending us Big5 traditional Chinese so while many of the byte values mapped to valid characters it was nonsensical.
UTF-8 basically learned from the mistakes of previous encodings which allowed that kind of thing.
I still use some tools that assume ASCII input. For many years now, Linux tools have been removing the ability to specify default ASCII, leaving UTF-8 as the only relevant choice. This has caused me extra work, because if the data processing chain goes through these tools, I have to manually inspect the data for non-ASCII noise that has been introduced. I mostly use those older tools on Windows now, because most Windows tools still allow you to set default ASCII.
In other words, yes it's backward compatible, but utf-is also compact and elegant even without that.
https://github.com/ParkMyCar/compact_str
How cool is that
(Discussed here https://news.ycombinator.com/item?id=41339224)
> how can we store a 24 byte long string, inline? Don't we also need to store the length somewhere?
> To do this, we utilize the fact that the last byte of our string could only ever have a value in the range [0, 192). We know this because all strings in Rust are valid UTF-8, and the only valid byte pattern for the last byte of a UTF-8 character (and thus the possible last byte of a string) is 0b0XXXXXXX aka [0, 128) or 0b10XXXXXX aka [128, 192)
UTF-32 has an entire spare byte to put flags into. 24 or 21 bit encodings have spare bits that could act as flags. UTF-16 has plenty of invalid code units, or you could use a high surrogate in the last 2 bytes as your flag.
Edit: see https://raw.githubusercontent.com/tsutsui/emacs-18.59-netbsd...
I don't know if you have ever had to use White-Out to correct typing errors on a typewriter that lacked the ability natively, but before White-Out, the only option was to start typing the letter again, from the beginning.
0x7f was White-Out for punched paper tape: it allowed you to strike out an incorrectly punched character so that the message, when it was sent, would print correctly. ASCII inherited it from the Baudot–Murray code.
It's been obsolete since people started punching their tapes on computers instead of Teletypes and Flexowriters, so around 01975, and maybe before; I don't know if there was a paper-tape equivalent of a duplicating keypunch, but that would seem to eliminate the need for the delete character. Certainly TECO and cheap microcomputers did.
This means that frame numbers in a FLAC file can go up to 2^36-1, so a FLAC file can have up to 68,719,476,735 frames. If it was recorded at a 48kHz sample rate, there will be 48,000 frames per second, meaning a FLAC file at 48kHz sample rate can (in theory) be 14.3 million seconds long, or 165.7 days long.
So if Unicode ever needs to encode 68.7 billion characters, well, extended seven-byte UTF-8 will be ready and waiting. :-D
NOW in hindsight it makes more sense to use UTF-8 but it wasn't clear back 20 years ago it was worth it.
Once enough people accepted that this approach was impractical, UCS-2 was replaced with UTF-16 and surrogate codes. At that point it was clear that UTF-8 was better in almost every scenario because neither had an advantage for random access and UTF-8 was usually substantially smaller.
Storage-wise, UTF-8 is usually better since so much data is ASCII with maybe the occasional accented character. The speed issue only really matters to Windows NT since that was UCS-2 inside, but it wasn't a problem for many.
So, it won't fill up during our lifetime I guess.
Imagine the code points we'll need to represent an alien culture :).
If we ever needed that many characters, yes the most obvious solution would be a fifth byte. The standard would need to be explicitly extended though.
But that would probably require having encountered literate extraterrestrial species to collect enough new alphabets to fill up all the available code points first. So seems like it would be a pretty cool problem to have.
So what would need to happen first would be that unicode decides they are going to include larger codepoints. Then UTF-8 would need to be extended to handle encoding them. (But I don't think that will happen.)
It seems like Unicode codepoints are less than 30% allocated, roughly. So there's 70% free space..
---
Think of these three separate concepts to make it clear. We are effectively dealing with two translations - one from the abstract symbol to defined unicode code point. Then from that code point we use UTF-8 to encode it into bytes.
1. The glyph or symbol ("A")
2. The unicode code point for the symbol (U+0041 Latin Capital Letter A)
3. The utf-8 encoding of the code point, as bytes (0x41)
Because if so: I don't really like that. It would mean that "equal sequence of code points" does not imply "equal sequence of encoded bytes" (the converse continues to hold, of course), while offering no advantage that I can see.
I realize that hindsight is 20/20, and time were different, but lets face it: "how to use an unused top bit to best encode larger number representing Unicode" is not that much of challenge, and the space of practical solutions isn't even all that large.
UTF-8 is the best kind of brilliant. After you've seen it, you (and I) think of it as obvious, and clearly the solution any reasonable engineer would come up with. Except that it took a long time for it to be created.
More importantly, that file has the same meaning. Same with the converse.
Imagine selecting New/Text Document in an environment like File Explorer on Windows: if the initial (empty) file has a BOM, any app will know that it is supposed to be saved again as UTF-8 once you start working on it. But with no BOM, there is no such luck, and corruption may be just around the corner, even when the editor tries to auto-detect the encoding (auto-detection is never easy or 100% reliable, even for basic Latin text with "special" characters)
The same can happen to a plain ASCII file (without a BOM): once you edit it, and you add, say, some accented vowel, the chaos begins. You thought it was Italian, but your favorite text editor might conclude it's Vietnamese! I've even seen Notepad switch to a different default encoding after some Windows updates.
So, UTF-8 yes, but with a BOM. It should be the default in any app and operating system.
bash: line 1: #!/bin/bash: No such file or directory
If you've got any experience with Linux, you probably suspect the problem already. If your only experience is with Windows, you might not realize the issue. There's an invisible U+FEFF lurking before the `#!`. So instead of that shell script starting with the `#!` character pair that tells the Linux kernel "The application after the `#!` is the application that should parse and run this file", it actually starts with `<FEFF>#!`, which has no meaning to the kernel. The way this script was invoked meant that Bash did end up running the script, with only one error message (because the line did not start with `#` and therefore it was not interpreted as a Bash comment) that didn't matter to the actual script logic.This is one of the more common problems caused by putting a BOM in UTF-8 files, but there are others. The issue is that adding a BOM, as can be seen here, *breaks the promise of UTF-8*: that a UTF-8 file that contains only codepoints below U+007F can be processed as-is, and legacy logic that assumes ASCII will parse it correctly. The Linux kernel is perfectly aware of UTF-8, of course, as is Bash. But the kernel logic that looks for `#!`, and the Bash logic that look for a leading `#` as a comment indicator to ignore the line, do *not* assume a leading U+FEFF can be ignored, nor should they (for many reasons).
What should happen is that these days, every application should assume UTF-8 if it isn't informed of the format of the file, unless and until something happens to make it believe it's a different format (such as reading a UTF-16 BOM in the first two bytes of the file). If a file fails to parse as UTF-8 but there are clues that make another encoding sensible, reparsing it as something else (like Windows-1252) might be sensible.
But putting a BOM in UTF-8 causes more problems than it solves, because it *breaks* the fundamental promise of UTF-8: ASCII compatibility with Unicode-unaware logic.
The Amiga always used all 8 bits (ISO-8859-1 by default), so detecting UTF-8 without a BOM is not so easy, especially when you start with an empty file, or in some scenario like the other one I mentioned.
And it's not that Macs and PCs don't have 8-bit legacy or coexistence needs. What you seem to be saying is that compatibility with 7-bit ASCII is sacred, whereas compatibility with 8-bit text encodings is not important.
Since we now have UTF-8 files with BOMs that need to be handled anyway, would it not be better if all the "Unicode-unaware" apps at least supported the BOM (stripping it, in the simplest case)?
What that question means is that the Unicode-unaware apps would have to become Unicode-aware, i.e. be rewritten. And that would entirely defeat the purpose of backwards-compatibility with ASCII, which is the fact that you don't have to rewrite 30-year-old apps.
With UTF-16, the byte-order mark is necessary so that you can tell whether uppercase A will be encoded 00 41 or 41 00. With UTF-8, uppercase A will always be encoded 41 (hex, or 65 decimal) so the byte-order mark serves no purpose except to signal "This is a UTF-8 file". In an environment where ISO-8859-1 is ubiquitous, such as the Web fifteen years ago, the signal "Hey, this is a UTF-8 file, not ISO-8859-1" was useful, and its drawbacks (BOM messing up certain ASCII-era software which read it as a real character, or three characters, and gave a syntax error) cost less then the benefits. But now that more than 99% of files you'll encounter on the Web are UTF-8, that signal is useful less than 1% of the time, and so the costs of the BOM are now more expensive than the benefits (in fact, by now they are a lot more expensive than the benefits).
As you can see from the paragraph above, you're not reading me quite right when you say that I "seem to be saying is that compatibility with 7-bit ASCII is sacred, whereas compatibility with 8-bit text encodings is not important". Compatibility with 8-bit text encodings WAS important, precisely because they were ubiquitous. It IS no longer important in a Web context, for two reasons. First, because they are less than 1% of documents and in the contexts where they do appear, there are ways (like HTTP Content-Encoding headers or HTML charset meta tags) to inform parsers of what the encoding is. And second, because UTF-8 is stricter than those other character sets and thus should be parsed first.
Let me explain that last point, because it's important in a context like Amiga, where (as I understand you to be saying) ISO-8859-1 documents are still prevalent. If you have a document that is actually UTF-8, but you read it as ISO-8859-1, it is 100% guaranteed to parse without the parser throwing any "this encoding is not valid" errors, BUT there will be mistakes. For example, å will show up as Ã¥ instead of the å it should have been, because å (U+00E5) encodes in UTF-8 as 0xC3 0xA5. In ISO-8859-1, 0xC3 is à and 0xA5 is ¥. Or ç (U+00E7), which encodes in UTF-8 as 0xC3 0xA7, will show up in ISO-8859-1 as ç because 0xA7 is §.
(As an aside, I've seen a lot of UTF-8 files incorrectly parsed as Latin-1 / ISO-8859-1 in my career. By now, if I see à followed by at least one other accented Latin letter, I immediately reach for my "decode this as Latin-1 and re-encode it as UTF-8" Python script without any further investigation of the file, because that Ã, 0xC3, is such a huge clue. It's already rare in European languages, and the chances of it being followed by ¥ or § or indeed any other accented character in any real legacy document are so vanishingly small as to be nearly non-existent. This comment, where I'm explicitly citing it as an example of misparsing, is actually the only kind of document where I would ever expect to see the sequence ç as being what the author actually intended to write).
Okay, so we've established that a file that is really UTF-8, but gets incorrectly parsed as ISO-8859-1, will NOT cause the parser to throw out any error messages, but WILL produce incorrect results. But what about the other way around? What about a file that's really ISO-8859-1, but that you incorrectly try to parse as UTF-8? Well, NEARLY all of the time, the ISO-8859-1 accented characters found in that file will NOT form a correct UTF-8 sequence. In 99.99% (and I'm guessing you could end up with two or three more nines in there) of actual ISO-8859-1 files designed for human communication (as opposed to files deliberately designed to be misparsed), you won't end up with a combination of accented Latin characters that just happen to match a valid UTF-8 sequence, and it's basically impossible for ALL the accents in an ISO-8859-1 document to just so happen to be valid UTF-8 sequences. In theory it could happen, but your chances of being struck by a 10-kg meteorite while sitting at your computer are better than of that happening by chance. (Again, I'm excluding documents deliberately designed with malice aforethought, because that's not the main scenario here). Which means that if you parse that unknown file as UTF-8 and it wasn't UTF-8, your parser will throw out an error message.
So when you encounter an unknown file, that has a 90% chance of being ISO-8859-1 and a 10% chance of being UTF-8, you might think "Then I should try parsing it in ISO-8859-1 first, since that has a 90% chance of being right, and if it looks garbled then I'll reparse it". But "if it looks garbled" needs human judgment. There's a better way. Parse it in UTF-8 first, in strict mode where ANY encoding error makes the entire parse be rejected. Then if the parse is rejected, re-parse it in ISO-8859-1. If the UTF-8 parser parses it without error, then either it was an ISO-8859-1 file with no accents at all (all characters 0x7F or below, so that the UTF-8 encoding and the ISO-8859-1 encoding are identical and therefore the file was correctly parsed), or else it was actually a UTF-8 file and it was correctly parsed. If the UTF-8 parser rejects the file as having invalid byte sequences, then parse it as the 8-bit encoding that is most likely in your context (for you that would be ISO-8859-1, for the guy in Japan who commented it would likely be Shift-JIS that he should try next, and so on).
That logic is going to work nearly 100% of the time, so close to 100% that if you find a file it fails on, you had better odds of winning the lottery. And that logic does not require a byte-order mark; it just requires realizing that UTF-8 is a rather strict encoding with a high chance of failing if it's asked to parse files that are actually from a different legacy 8-bit encoding. And that is, in fact, one of UTF-8's strengths (one guy elsewhere in this discussion thought that was a weakness of UTF-8) precisely because it means it's safe to try UTF-8 decoding first if you have an unknown file where nobody has told you the encoding. (E.g., you don't have any HTTP headers, HTML meta tags, or XML preambles to help you).
NOW. Having said ALL that, if you are dealing with legacy software that you can't change which is expecting to default to ISO-8859-1 encoding in the absence of anything else, then the UTF-8 BOM is still useful in that specific context. And you, in particular, sound like that's the case for you. So go ahead and use a UTF-8 BOM; it won't hurt in most cases, and it will actually help you. But MOST of the world is not in your situation; for MOST of the world, the UTF-8 BOM causes more problems than it solves. Which is why the default for ALL new software should be to try parsing UTF-8 first if you don't know what the encoding is, and try other encodings only if the UTF-8 parse fails. And when writing a file, it should always be UTF-8 without BOM unless the user explicitly requests something else.
I'm also saying that apps should not create a BOM header any more (in UTF-8 only, not in UTF-16 where it's required), because the costs of dealing with BOM headers are higher than they're worth. Except in certain specific circumstances, like having to deal with pre-Unicode apps that default to assuming 8-bit encodings.
https://x.com/jbogard/status/1111328911609217025
If that link doesn't work, then try:
https://xcancel.com/jbogard/status/1111328911609217025
Source (which will explain the joke for anyone who didn't get it immediately):
https://www.jimmybogard.com/the-curious-case-of-the-json-bom...
Not ALL of the 20th-century Internet has bit-rotted and fallen apart yet. (Just most of it).
The correct approach is to use and assume UTF-8 everywhere. 99% of websites use UTF-8. There is no reason to break software by adding a BOM.
You _do_ need a BOM for UTF-16 and UTF-32.
I also agree that "BOM" is the wrong name for an UTF-8... BOM. Byte order is not the issue. But still, it's a header that says that the file, even if empty, is UTF-8. Detecting an 8-bit legacy character set is much more difficult that recognizing (skipping) a BOM.
- first, byte order doesn't affect the UTF-8 encoding,
- second, the codeset metadata problem you're trying to solve is a problem that already existed before and still does after UTF-8 enters the scene -- you just have to know if some text file (or whatever) uses UTF-8, ISO 8859-x, SHIFT-JIS, UTF-16, etc.
The second point addresses your concern, but that metadata has to be out of band. Putting it in-band creates the sorts of problems that others have pointed out, and it creates an annoyance once all non-Unicode locales are gone. And since the goal is to have Unicode replace all other codesets, and since we've made a great deal of progress in that direction, there is no need now to add this wart.
In a future where everything defaults to UTF-8 it makes sense. This is probably easier to envision in an English-only context where the jump from 7-bit ASCII to UTF-8 is cleaner.
Where I come from, UTF-8 is not always supported. Without a header (or "BOM", though we don't like the name) you don't know in what encoding a text file was meant to be (re-)saved as when it was created. My example of an empty file was meant to illustrate that. But leaning on the Utopian side, I too shall put more energy towards all apps supporting UTF-8 :)
Yeah, UTF-8 by default -or better, as the only option- is the dream.
Keep in mind that if you do use a BOM for UTF-16 then it's possible to reliably tell that some file is in UTF-8.
Most other standards just do the xkcd thing: "now there's 15 competing standards"
Is that all Unicode can do? How are they going to fit all the emojis in?
297334 codepoints have been assigned so far, that‘s about 1/4 of the available range, if my napkin math is right. Plenty of room for more emoji.
The network addresses aren't variable length, so if you decide "Oh IPv6 is variable length" then you're just making it worse with no meaningful benefit.
The IPv4 address is 32 bits, the IPv6 address is 128 bits. You could go 64 but it's much less clear how to efficiently partition this and not regret whatever choices you do make in the foreseeable future. The extra space meant IPv6 didn't ever have those regrets.
It suits a certain kind of person to always pay $10M to avoid the one-time $50M upgrade cost. They can do this over a dozen jobs in twenty years, spending $200M to avoid $50M cost and be proud of saving money.
> Show the character represented by the remaiing 7 bits on the screen.
I notice there is a typo.
† With an occasional UNICODE flourish.
ISO 2022 allowed you to use control codes to switch between ISO 8859 character sets though, allowing for mixed script text streams.
UTF-8 made it all relatively neat back in the day. There are still ways to throw a wrench into the gears. For example, how do you handle UTF-8 encoded surrogate pairs? But at least one can filter that out as suspicious/malicious behavior.
Surrogate pairs aren’t applicable to UTF-8. That part of Unicode block is just invalid for UTF-8 and should be treated as such (parsing error or as invalid characters etc).
Maybe as to emojis, but otherwise, no, Unicode is not a dumpster fire. Unicode is elegant, and all the things that people complain about in Unicode are actually problems in human scripts.
I could ask Gemini but HN seems more knowledgeable.
Almost the entire world could have ignored it if not for Microsoft making the wrong choice with Windows NT and then stubbornly insisting that their wrong choice was indeed correct for a couple of decades.
There was a long phase where some parts of Windows understood (and maybe generated) UTF-16 and others only UCS-2.
Java's mistake seems to have been independent and it seems mainly to have been motivated by the mistaken idea that it was necessary to index directly into strings. That would have been deprecated fast if Windows had been UTF-8 friendly and very fast if it had been UTF-16 hostile.
We can always dream.
There really was a time when UTF-16 (or rather UCS2) made sense.
What about UTF-7? That seemed like a bad idea even at the time.
It's also the reason why Unicode has a limit of about 1.1 million code points: without UTF-16, we could have over 2 billion (which is the UTF-8 limit).