A lot of coders, those who have worked in primarily english countries see ascii as utf-8 and the difference is invisible. They can go decades being oblivious to topics like encodings and mappings and display.
So it can be surprising to them when they start dealing with European characters for the first time. They view the text in one place (like an editor which treats the file as utf-8) and another (their program) which treats the text as ASCII.
It's hard to explain to them that "when I look at it" isn't a universal truth, it also matters how the "look at it program" chooses to interpret, and display, it.
Something assumptions and asses.
Only to have obvious junior devs throw all of that out the Windows and now most things are dictated by the users region. Yes, I live in Germany and the region is important for region-locked shit on the app store. That does not fucking mean I want my OS talk to me in German.
And on M365, I want German date formatting (DD.MM.YYYY), but when M365 is in English, you cannot select that date formatting, because someone thought that Americans would never need that.
Fuck all of these ignorant bastards!
Surely you mean that sorting correctly is impossible without Unicode? Otherwise you would have to hardcode the rules of sorting strings correctly in my language (and all other languages) yourself.
Unless your preferred solution is "close my eyes and prefer non-ascii characters don't exist", then... I'm not a fan.
Case conversion is similar except the default rules do a very good job in general. But still, there are a few language-specific quirks and, again, you do have to know what language is involved to get those right.
I'm agreeing with you, to be clear, just adding that a) Unicode isn't always enough, but it does a decent job if you don't know the language in advance, and that it defines the correct rules if you do know that.
This isn’t necessarily true beyond ASCII, and it depends entirely on the collation [0]. One need only to spend some time peering into the abyss that is RDBMS collation support [1] [2] to see the horror.
[0]: http://www.unicode.org/reports/tr10/
[1]: https://dev.mysql.com/doc/refman/8.4/en/charset-unicode-sets...
How useful this is depends on the language, of course. I did say that as well. But Unicode was put together from legacy character encodings, and did what it could to preserve the order of those, so it's far from useless.
In the end, what truly matters is whether the codebase is consistent—either using tabs or spaces throughout
I use tabs for code indentation, but spaces for non-code indentation (eg: for ascii diagrams within comments).Anyone who has converted a lot of code, from different projects, from spaces to tabs will have noticed: the vast majority of code with spaces contains a few screwups where a line or two in a 4-spaced file actually contains 3 spaces.
Why that happens, despite editors automatically converting tabs to spaces, is beyond me, but it is a ubiquitous phenomenon. I suspect this is the real reason some people, certainly myself, prefer tabs.
Screwups like missing or adding a space can happen easily even with auto-indent, a common cause is splitting or merging a line, i.e. changing a space with a newline and vice versa. That space character has a tendency to end up where you don't want it, or conversely, get eaten up. When using tabs, invisible space characters can end up between tabs.
In the end, on collaborative projects, I usually settle on 4 space indentation, as it is the most common and from my experience, the least likely for people to screw up.
In my experience it's usually from copy-paste, usually because the cursor wasn't at the right position when pasting. The cursor not being at the right position because you deleted some spaces to reduce the level of indentation before pasting, but didn't do the right number. While tab inserts the right number of spaces, delete still deletes spaces.
Also occasionally due to a find-replace that accidentally included a leading space, which can be hard to see when the find/replace boxes are in a proportional font.
I know not everyone will agree with me, but I think defining whitespace in a language as essentially [ \t\n] between any token is a language design mistake.
I would prefer to have the spacing version of this problem, personally, because that way I can always see that there's a problem, and can do so without resorting to changing tab widths or making invisible characters visible.
I personally use autoformatter in all CI pipelines, and error out for every change. This entirely kills the whole issue of wrong indentation/dangling spaces/accidental tabs/inconsistent formatting, etc.
As an aside, what tools do you use to produce the diagrams?
Contrary to the way I worded my comment, my 'diagrams' are typically no more than text with perhaps a box-drawing unicode character here or there. But even drawing a simple tree, tabs can mess up important details.
> Stop hard-wrapping and just use soft-wrapping,
Grep for some pattern in soft-wrapped text and you get a lot of extraneous material.
You also can't grep for things "at the beginning of the line", which is often an important indicator. When I did a lot of plain C programming, I would put function names at the start of a line, below their return type to make it easy to grep for a function definition, rather than just uses.
Soft-wrapping also limits the use of diffability, a complement to grepability. You might correct a single letter in a misspelled word in a soft-wrapped paragraph. Do a "git diff" or equivalent and you'll get back a huge block of "changed" text. Useless. Short, hard wrapped lines make it easy to see diffs.
Wrapping is just as simple as; `fold -s -w 80 input.txt`
Unwrapping usually turns out to be harder according to my experiences. [1]
> You also can't grep for things "at the beginning of the line", which is often an important indicator. When I did a lot of plain C programming, I would put function names at the start of a line, below their return type to make it easy to grep for a function definition, rather than just uses.
I see what you mean. But I don’t think your approach conflicts with my recommendation for soft-wrapping. You can still soft-wrap regular text files while choosing to separate certain lines of code for clarity. What you’re doing might not even be considered "hard-wrapping" in the typical sense—it's not like you're breaking a 240-character line into multiple lines. You're simply formatting the definition in a way that suits your style, and it's perfectly ok!
For the last one, you can simply use `git diff --word-diff`. Also, platforms like GitHub already highlight word-based diffs, so it usually is very easy to spot the changes.
I do a lot of Go programming these days, and there's a conventional format for code that ends up with a lot of hard wrapped lines, so my C example is just that, an example.
Maybe Markdown would be a better example. When I edit markdown, I move around phrases, clauses and sentences. It's certainly possible to do this with a gigantic soft wrapped chunk of text, but it's much easier with one clause or even phrase per hard wrapped (at 74 characters or less) line. Grepability and diffability and even running text through sed or awk are easier. You're not relying on text coloring. Editing with vim is easier, it has commands to move the cursor to next word, previous paragraph etc etc.
This is one of those things like tabs or spaces and byte order marks. We're unlikely to convince each other.
So I think that's a good argument for doing semantic wrapping of code and text (I guess semantic wrapping for code is just not writing everything in one long line separated by semicolons), but once you've put in semantic line breaks, you still need to decide how to handle text that spans multiple lines.
> But if that sentence runs over e.g. 80 characters,
> you still need to decide
> whether you're going to hard wrap or soft wrap that sentence.
No I don't. Semantic wrapping all the way. > This is a sentence that includes the word "Lopadotemachoselachogaleokranioleipsanodrimhypotrimmatosilphiokarabomelitokatakechymenokichlepikossyphophattoperisteralektryonoptekephalliokigklopeleiolagoiosiraiobaphetraganopterygon" in it.
> How should it be wrapped semantically?
This is a psychological case to demonstrate how semantic wrapping does not by itself solve the "hard vs soft" wrapping question. If the answer is that the word should remain as a single word, then you are using soft wraps (or no wraps at all). If the answer is that the word should be split into 80 character chunks, then you're using hard wraps.I have no idea what the semantics of that word are, which is information that is required in order to properly semantically wrap it. (Inherently, since conveying such semantics is one of the major pointer of semantic wrapping.)
However, you included embedded control characters (C2 AD aka 'SOFT HYPHEN'; below replaced with '-') that encode less semantic information than is necessary for proper semantic wrapping, but not none:
Lopado-temacho-selacho-galeo-kranio-leipsano-drim-hypo-trimmato-silphio-karabo-melito-katakechy-meno-kichl-epi-kossypho-phatto-perister-alektryon-opte-kephallio-kigklo-peleio-lagoio-siraio-baphe-tragano-pterygon.
Web browsers use that information to do poor-quality semantic wrapping automatically - actual hard or soft[0] wrapping would produce something like:
Lopadotemachoselachogaleokranioleipsanod-
rimhypotrimmatosilphiokarabomelitokatake-
chymenokichlepikossyphophattoperisterale-
ktryonoptekephalliokigklopeleiolagoiosir-
aiobaphetraganopterygon.
Which looks like the following from a partly-semanically-aware perspective:Lopado-temacho-selacho-galeo-kranio-leipsano-d[BREAK]rim-hypo-trimmato-silphio-karabo-melito-katake[BREAK]chy-meno-kichl-epi-kossypho-phatto-perister-ale[BREAK]ktryon-opte-kephallio-kigklo-peleio-lagoio-sir[BREAK]aio-baphe-tragano-pterygon.
The fact that you included soft hyphens rather concedes the point that hard and soft[0] wrapping is incorrect[1].
0: Or rather, non-semantic, which is what we're actually arguing over. Technically, semantic wrapping is a subset of hard wrapping, but it's a specific subset that isn't what is expressed by just saying "hard wrapping". Kind of like how birds aren't what anyone means when they just say "dinosaurs".
1: Granted, to be fair, a lot of the time we just don't care. But (contra your original comment) we never need to resort to non-semantic wrapping; we just sometimes (often) decide to be lazy because it doesn't matter.
Instead, I would prefer a soft semantic wrap: if a single semantic unit (be that a word, a clause, or whatever else) extends beyond, say, 80 characters, we keep it on the same line and let the editor/file viewer handle wrapping. This means that we maintain grepability over words and semantically-connected phrases, and we maintain diffability by avoiding the hard-wrap cascade. To me, this is a much more useful version of semantic wrapping, because it only wraps when there is a semantic clause, and not on any arbitrary semantic break.
My goal here isn't to convince you that this version is better than your version of semantic wrapping, only that wrapping based on semantics is an orthogonal concept to hard and soft wrapping, and that even if we choose to take a semantic wrapping approach, we still need to decide what to do with particularly long lines.
(Although I will add to this: I had a colleague who was a deep fan of semantic wrapping, and I just never really got it. I used it for a couple of years, but I've never run into issues with simply soft-wrapping everything. When inserting new clauses or changing text in the middle of a line, every diff tool that I've used has been able to accurately identify which portion of a given paragraph has changed and highlight it. Meanwhile, as a writer and reader, I need to put more effort into reading prose that is written in an odd, stylised format that is very different from the intended paragraph structure. I can see the argument that I've accepted semantic line breaks in code or configuration files, so I should be able to handle it in markdown, but I just find it harder to read and more irritating to write. But assuming someone does want to use semantic line breaks, I still believe that that's an orthogonal choice to deciding between hard and soft wrapping.)
So would I, but...
> if a single semantic unit (be that a word, a clause, or whatever else) extends beyond, say, 80 characters, we keep it on the same line and let the editor/file viewer handle wrapping.
...the editor can't do that because it doesn't understand the semantics.
> that wrapping based on semantics is an orthogonal concept to hard and soft wrapping
Yes, that's why I've been saying "hard and/or soft [but in either case nonsemantic] wrapping".
> > > With semantic wrapping you put each sentence (or similar) on a new line [...] But if that sentence runs over e.g. 80 characters, [then...]
... You don't need to fall back on non-semantic wrapping, you can just just keep breaking it up into smaller and smaller semantically-meaningful pieces.
(You have to do that 'hard'-ly because the editor doesn't understand the semantics, but that's not "decid[ing] whether you're going to hard wrap or soft wrap", it's being forced to hard wrap as a implementation detail because that's what results in correct wrapping.)
It might not be worth the effort to do that, but you're never forced not to (given not-pathologically-short line length limits like 20 characters).
To you, I get the impression that semantic wraps are about ensuring that every wrap/line break happens at a semantically valid place, where semantically valid could be a semantically valid clause, but also a semantically valid intra-word line break.
In that sense, I can see how your strategy would produce the same effects as hard wrapping, albeit with different choices about where to put the wraps. But I think then, like I said, you end up running into the same difficulties that you do with conventional hard wrapping, at least in pathological cases.
Yes, with the obvious possible exception of trivial/degenerate cases like "i++; j--;" in C or "This is a cat. That is a dog." in English.
> and every line break represents a semantic clause or sentence gap.
Specifically, it represents a maximally coarse semantic gap, drilling as shallowly down into subclauses as possible/practical.
> wrap/line break [can happen at ...] also a semantically valid intra-word line break.
Preferably only if that word would already be alone on its overly-long line. Eg:
# bad, breaks subordinate clause before superordinate
That sounds supercalifragilistic-
expialidocious.
# semantically valid, but ugly (a pathological case)
That sounds
supercalifragilisticexpialidocious.
# vertically larger, but probably fine
# (unless you're feeling incunabulum-y[0])
That sounds
supercalifragilistic-
expialidocious.
> you end up running into the same difficulties that you do with conventional hard wrapping, at least in pathological cases.I've yet to see any evidence that really pathological cases exist. (As opposed to "I'm lazy and can't be arsed" cases, which I'm fairly explicitly not disputing.)
Poor phrasing; 20 characters was meant as a example of a limit that is pathologically short.
there are better tools for that that show word-based diff instead of a huge block. There aren't such tools that can convert your hard-coded linebreaks back.
And the support for soft-wrapping in tools varies: it may be completely unavailable, or just turned off by default, and generally unused in such a case.
I think reflowable text enters the area of markup languages, rather than plain text.
But anyway, don't you have a code editor with a sidebar with function names, that you can click on to go to the definitions? Sounds like choosing to navigate via grep is the nature of the problem with grepability. And other search tools that aren't regex based can search for multi-line text. This isn't about plain text, it's about Vim. It's like saying "this farmer's field should be constructed differently because it isn't skateable".
Why can't you, is there no way to make grep work with regular expressions??
This is actually one of HTML's most underrated features - there is no distinction between hard and soft wrapping. Any whitespace, of any form and quantity, between any two words is just converted to a single space in the rendered output.
Thus the developer, in a code editor, is free to hard wrap and indent the text in whatever way makes the most visual sense. Meanwhile in the rendered output the actual wrapping that occurs (if any) is controlled by the stylesheet.
I wish more programming languages had multiline string syntax that could do this (automatically remove all newlines and indentation). It turns out to be quite useful in a variety of domains.
But then you need some way to provide the exact indentation/spacing in some cases. And the easiest is to provide them verbatim.
Can be used on any element of course, not just <pre>.
https://searchfox.org/mozilla-central/source/layout/style/re...
The Firefox default style sets a fixed width font and sets a small margin. What's "a lot more"?
In these types of communities there is no formal markup. So what is code and what is text? You can’t tell. Some might use “code fence”. Some might use four-space indents. Some might just dump code in between prose. And when you comment on a patch you comment directly on the diff.[1]
You can’t just let the email reader go to town on the text. That’s fine for prose but annoying for code where every line break is either intentional or machine formatted.
The author mentions the downside of browsing on a mobile device. Yeah I sometimes do that. But the primary mode for this kind of browsing might just be on a laptop/desktop. Certainly if you plan on doing some coding. (not just browsing the email archive for discussions that happened eight years ago... not that I would ever do that)
[1] Maybe diffs are easy to parse out of a message since each line starts with `+`, `-` or a space. After you have peeled away the quoting.
Sure, now. But, there was a time when I was a young man in college (circa 1997) where professors and the industry would push Tabs as a standard. Shortly after, the tides changed and we were all using spaces.
> Stop hard-wrapping and just use soft-wrapping
Who cares about grep? I mean, aside from the OP and probably many on here. Wrapping is a task that should be left up to the viewing device/software. It can be made to be responsive, which hard-wrapping cannot be.
> newline
This really should be a solved issue by now. Both as users and by software.
Very interesting. Thanks for sharing this information! What do you think might have caused this though?
> Who cares about grep?
I do care. I find it much easier to work with a codebase that has logs and error messages that can be easily searched. Similarly, working on a blog with searchable text makes more sense to me. Before switching to soft-wrapping, I used hard-wrapping, and sometimes I would notice a typo or an issue in one of the essays. When I tried to quickly search for a nearby word, it wouldn’t find it because the text had been hard-wrapped. I think it also makes it far easier for outsiders to navigate a repo which they are not familiar with.
About the newline, I agree.
The rise in web-browser-based code editors, where tab moves to the next field on the form instead of inserting a literal \t character.
- if it’s code, you should be using an automatic code formatter and that’s it. - if you write prose, sure, soft wrap.
If you grep, \s doesn’t care about space vs tabs.
Sadly, elastic tabs never caught as the default [0].
Maybe we would need something like a "semantic alignment" marker instead of using spaces for aligning things. Like beginning of function name, beginning of function argument, etc.
unless you want better alignment similar to the example of vertical alignment, which automatic code formatters simply don't allow for since they're much more simplistic.
> If you grep, \s doesn’t care about space vs tabs.
so now instead of regular easy-to-type space you have to do \s to search for a phrase?
This rules out special characters and binary file formats, of course.
Prose can also have strict structure.
The other way leads to a bunch of random indentation levels all over the file which has always looked ugly to me.
I learned it the hard way with my static site generator and the pages of my website that use Umlauts. It introduced subtle problems where Syncthing would replace one standard with another, and nginx would suddenly 404 on URLs that looked fine to me.
Here's my user test: https://news.pub/?try=https://www.youtube.com/embed/rx7nv6R5...
You never know how paths might cross.
I think the information you shared about the tabs is worth mentioning. I'll reference your video and the tabs info you provided in the addendum.
Or if you used elastic tabstops, the pursose of using tabs would be that this alignment happens automatically on edits instead of you having to adjust the number of spaces manually
Then you rewrap and commit that, with a comment "rewrap" so the reader knows there is no material change.