I'm sure you already know this one, but for anyone else reading this I can share my favourite StackOverflow answer of all time: https://stackoverflow.com/a/1732454
I know it's a hassle for a platform to moderate good rants from bad ones, and I decry SO from pushing too hard against these. I truly believe that our industry would benefit from more drunken technical rants.
I don't suggest writing generic HTML parsers that works with any site, but for custom crawlers they work great.
Not to say that the tools available are the same now as 20 years ago. Today I would probably use puppeteer or some similar tool and query the DOM instead.
> For some reason or other, people have been posting a lot of excerpts from old emails on Twitter over the last few days.
On the risk of having missed the latest meme or social media drama, but does anyone know what this "some reason or other" is?
Edit: Question answered.
But not everybody has every single global development / news event IVed into their veins. Many of us just don’t keep updated on global news such that we may not be aware of an event that happened in the last 3 days.
Important news tends to get to me eventually. And there is usually nothing I can do about something personally anyway (at least within a short time horizon), so there is really very little value in trying to stay informed of the absolute latest developments. The signal to noise ratio is far too low, and it also induces a bunch of unnecessary anxiety and stress.
So yes, believe it or not very many people are unaware of this.
Why do mail server care about how long a line is? Why don't they just let the client reading the mail worry about wrapping the lines?
The server needs to parse the message headers, so it can't be an opaque blob. If the client uses IMAP, the server needs to fully parse the message. The only alternative is POP3, where the client downloads all messages as blobs and you can only read your email from one location, which made sense in the year 2000 but not now when everyone has several devices.
POP3 is line–based too, anyway. Maybe you can rsync your maildir?
Given a mechanism for soft line breaks, breaking already at below 80 characters would increase compatibility with older mail software and be more convenient when listing the raw email in a terminal.
This is also why MIME Base64 typically inserts line breaks after 76 characters.
https://en.wikipedia.org/wiki/BITNET
BITNET connected mainframes, had gateways to the Unix world and was still active in the 90s. And limited line lengths … some may remember SYSIN DD DATA … oh my goodness …
https://www.ibm.com/docs/en/zos/2.1.0?topic=execution-systsi...
telnet smtp.mailserver.com 25
HELO
MAIL FROM: me@foo.com
RCPT TO: you@bar.com
DATA
blah blah blah
how's it going?
talk to you later!
.
QUIT
openssl s_client -connect smtp.mailserver.com:smtps -crlf
220 smtp.mailserver.com ESMTP Postfix (Debian/GNU)
EHLO example.com
250-smtp.mailserver.com
250-PIPELINING
250-SIZE 10240000
250-VRFY
250-ETRN
250-AUTH PLAIN LOGIN
250-ENHANCEDSTATUSCODES
250-8BITMIME
250-DSN
250-SMTPUTF8
250 CHUNKING
MAIL FROM:me@example.com
250 2.1.0 Ok
RCPT TO:postmaster
250 2.1.5 Ok
DATA
354 End data with <CR><LF>.<CR><LF>
Hi
.
250 2.0.0 Ok: queued as BADA579CCB
QUIT
221 2.0.0 ByeEdit: wrong.
However what most mail programs show as sender and recipient is neither, they rather show the headers contained in the message.
I suspect this is relevant because Quoted Printable was only a useful encoding for MIME types like text and HTML (the human readable email body), not binary (eg. Attachments, images, videos). Mail servers (if they want) can effectively treat the binary types as an opaque blob, while the text types can be read for more efficient transfer of message listings to the client.
https://www.ibm.com/docs/en/zos/2.1.0?topic=execution-systsi...
And BITNET …
It never got too popular, but I had users for a few years and I can honestly say MIME was the bane of my life for most of those years.
I think there is a second possible conclusion, which is that the transformation happened historically. Everyone assumes these emails are an exact dump from Gmail, but isn't it possible that Epstein was syncing emails from Gmail to a third party mail server?
Since the Stackoverflow post details the exact situation in 2011, I think we should be open to the idea that we're seeing data collected from a secondary mail server, not Gmail directly.
Do we have anything to discount this?
(If I'm not mistaken, I think you can also see the "=" issue simply by applying the Quoted-Printable encoding twice, not just by mishandling the line-endings, which also makes me think two mail servers. It also explains why the "=" symbol is retained.)
I wonder why even have a max line length limit in the first place? I.e. is this for a technical reason or just display related?
I wonder if the person who had the idea of virtualizing the typewriter carriage knew how much trouble they would cause over time.
Edit: yes I think that's most likely what it is (and it's SHOULD 78ch; MUST 998ch) - I was forgetting that it also specifies the CRLF usage, it's not (necessarily) related to Windows at all here as described in TFA.
Here it is in my 'notmuch-more' email lib: https://github.com/OJFord/amail/blob/8904c91de6dfb5cba2b279f...
The article doesn't claim that it's Windows related. The article is very clear in explaining that the spec requires =CRLF (3 characters), then mentions (in passing) that CRLF is the typical line ending on Windows, then speculates that someone replaced the two characters CRLF with a one character new line, as on Unix or other OSs.
It's just sp hacky i cant belive it's a real life's solution
Consider converting the original text (maintaining the author’s original line wrapping and indentation) to base64. Has anything been “inserted” into the text? I would suggest not. It has been encoded.
Now consider an encoding that leaves most of the text readable, translates some things based on a line length limit, and some other things based on transport limitations (e.g. passing through 7-bit systems.) As long as one follows the correct decoding rules, the original will remain intact - nothing “inserted.” The problem is someone just knowledgeable enough to be aware that email is human readable but not aware of the proper decoding has attempted to “clean up” the email for sharing.
Infinite line length = infinite buffer. Even worse, QP is 7-bit (because SMTP started out ASCII only), so characters >127 get encoded as three bytes (equal, then two hex digits), so a 500-character non-ASCII UTF8 line is 1500 bytes.
It all made sense at the time. Not so much these days when 7-bit pipes only exist because they always have.
But I agree with sibling comment: it makes more sense when its called "encoding" instead of "inserting chars into original stream"
Digital communication is based on the postmen reading, transcribing and copying your letters. There is a reason why digital communication is treated differently then letters by the law and why the legally mandated secrecy for letters doesn't apply to emails.
Did the site get the HN kiss of death?
I, too, was reading about the new Epstein files, wondering what text artifact was causing things to look like that.
https://nitter.net/AFpost/status/2017415163763429779?s=201
Something clearly went wrong in the process.
I'm glad to know the real reason!
cat title | sed 's/anyway/in email/'
would save a click for those already familiar with =20 etc.On a side note: There are actually products marketed as kosher bacon (it's usually beef or turkey). And secular Jews frequently make jokes like this about our kosher bros who aren't allowed to eat the real stuff for some dumb reason like it has too many toes.
We’ve become so accustomed to modern libraries handling encoding transparently that when raw data surfaces (like in these dumps), we often lack the 'Digital Archeology' skills to recognize basic Quoted-Printable.
These artifacts (=20, =3D) are effectively fossils of the transport layer. It’s a stark reminder that underneath our modern AI/React/JSON world, the internet is still largely held together by 7-bit ASCII constraints and protocols from the 1980s.
The writer presumably knows that umlauts and other non-ascii characters are functional in many languages. "rock döts" is poking fun at the trend in a certain tranche of anglophone rock/metal to use them in a purely aesthetic way in band names etc.
Back in those days optical scanners were still used.