How to Store Data on Paper?(www.monperrus.net)

197 pointsby mofosyne7 days ago38 comments

pastage3 days ago
Bits per [cm²|cm³|kg] is interesting like you get with cuneiform ceramic tablets[1], this one get about 1 word per cm² and cuneiform is crazy dense, I have no real grasp of how sumerian or akkadian words worked. I think it was heavily context based because from some lecture[2] at the British Museum.
I have seen people do ceramics where information was stacked in layers and had to be destroyed to extract. The ultimate form of shifting media to preserve and read information. I guess that could done with better resolution with 3D printed Zirconia (0.1 mm³ blobs) so 1Mb /cm³
Edit: this idea of a cold storage is from Footfall by Niven and Pournelle, where information was stored on monoliths where layers could be incrementally extracted with tools documented on the above layers. i.e. start with 0.1 bit per m² and go down, done with the hand wavy handling of practical problems in science fiction.
[1] https://www.bookandsword.com/2016/10/29/the-information-dens...
[2] https://youtu.be/XVmsfL5LG90
- pavel_lishin3 days ago
  I don't recall the Fithp's artefacts requiring destruction to read; I thought they were just created with (presumably) lasers, writing the information in a way that would resist erosion - if one does get eroded, you just slice the eroded part away, re-revealing it again.
  - pastage2 days ago
    I do not remember, and tried not not imply destruction. It is just the easiest way to do it on your own with ceramics.
- dsign3 days ago
  Akkadian is/was syllabic. The language is pretty well preserved I believe, some say there is more text in Akkadian than in classical Latin[^1].
  [^1]: Can't find the source right now, so take this with a grain of salt.
tocs36 days ago
I have been thinking about this for a long time. Thanks for the link.
The biggest advantage of character-based encodings is that they can be decoded by humans (as opposed to dot-based encodings), which means that you don’t need a camera or a scanner to recover the data.
This is an interesting point. In our post apocalyptic future scholars will be using their quills to translate archives of these (in my imagination anyway). Of course they would have to translate into binary and then into human chars.
I can imaging they will be sad they cannot listen to the mp3's.
Adding color allows on to code more information per dot (3x more with three colors).
Is this right? Wouldn't it be base-3 encoding? Three bits of binary can count to 8. Three trits of base three can count to 27. Color has all sorts of disadvantages but maybe a much greater payoff (unless I m mistaken).
- kragen3 days ago
  If a pixel can be printed with no colors (white), cyan, magenta, yellow, cyan and magenta (blue), magenta and yellow (red), yellow and cyan (green), or all three inks (black), that's 8 colors, 3 bits per pixel, not just 3 colors. Typically laser and inkjet printers do more or less work like this, but also have a fourth ink, which is black.
  I am very skeptical of this idea that people will be able to write but unable to produce useful digital computers. Computers are a mathematical discovery, not an electronic invention. Electronics makes them a thousand times faster, but a computer made out of wood, flax threads, and bent copper wire would still be hundreds of times faster than a person at tabulating logarithms, balancing ledgers, calculating ballistic trajectories, casting horoscopes, encrypting messages, forecasting finances, calculating architectural measurements, or calculating compound interest. So I think automatic computation as such is likely to survive if any human intellectual tradition does.
  - tocs33 days ago
    I am very skeptical of this idea that people will be able to write but unable to produce useful digital computers.
    I agree. When I first saw the post and the mention of humans in the reading end of the loop, I though "maybe there is a scifi story here". Hard to imagine a scenario that left humans but not many artifacts except caches of paper (or other "printed" media). Maybe a remote tribe of uncontacted people (or another species altogether) inherit the Earth after a modern world apocalypse kills off everyone in the technologically more advanced world.
    A civilization starting from scratch would still need to develop a fair bit of math and tech/science sophistication before understanding and starting to use artifacts left behind. In particular optical/color on paper scanners would have been difficult before the 20th century.
- usrbinbash4 days ago
  > In our post apocalyptic future scholars will be using their quills to translate archives of these
  Imagine tomes of programming lore, dutifully transcribed by rooms of silent scribes, acolytes carrying freshly finished pages to and fro, each page beautifully illuminated wih pictures of the binary saints, to ward off Beelzebug.
  - sweettea4 days ago
    See also: the first part of A Canticle for Leibowitz.
    nulbyte3 days ago
    Thank you for this. I had never heard of this book, but it sounds intriguing, and my local bookstore happens to have a copy.
- adzm4 days ago
  The inhernt errror resilience in charactre encoding of human languige is also an intersetnig point.
  - myself2483 days ago
    This is why, when pulling wire, I write out the numbers longhand on the end of each one. "SEVENTEEN" is a lot more smudge-resistant and unambiguous umop-apisdn than "L1".
- mackmgg4 days ago
  > Is this right? Wouldn't it be base-3 encoding? Three bits of binary can count to 8. Three trits of base three can count to 27. Color has all sorts of disadvantages but maybe a much greater payoff (unless I m mistaken).
  In this case they're not directly using the color to store information, they just have three differently colored QR codes overlayed on top of each other. With that method you can use a filter to separate them back out and you've got three separate QR codes worth of data in one place. The way they're added ends up using more than just three colors in that example.
  If you were truly to use colored dots to store binary information without worrying about using a standard like QR, I think you'd be going from base-2 (white and black) to base-3 (red, blue, green) or more likely base-4 (white, red, blue, green) or even base-8 (if you were willing to add multiple colors on top of each other) in which case yeah you'd have way more than just 3x the data density.
  - Clamchop2 hours ago
    CMYK makes more sense for printing, e.g. https://en.m.wikipedia.org/wiki/High_Capacity_Color_Barcode
  - CorrectHorseBat4 days ago
    >this case they're not directly using the color to store information, they just have three differently colored QR codes overlayed on top of each other. With that method you can use a filter to separate them back out and you've got three separate QR codes worth of data in one place. The way they're added ends up using more than just three colors in that example.
    That's only true if you can print and read colors in a higher resolution/don't destroy information at 3x the density with color, I'm not sure if that's generally true.
    >If you were truly to use colored dots to store binary information without worrying about using a standard like QR, I think you'd be going from base-2 (white and black) to base-3 (red, blue, green) or more likely base-4 (white, red, blue, green) or even base-8 (if you were willing to add multiple colors on top of each other) in which case yeah you'd have way more than just 3x the data density.
    Base 8 is exactly 3x the data density. (Log(8)/log(2))
- spencerflem4 days ago
  I think for that use-case (copying by quill), just writing plaintext from the start would be the move
- CorrectHorseBat4 days ago
  Adding 3 colors would make it base 5 (BW+rgb) and give log(5)/log(2) or about 2.3 times the information per dot.
  - LgWoodenBadger3 days ago
    2 dots at 5 possibilities each gives 25 (5^2)
    2 dots at 2 possibilities each gives 4 (2^2)
    They only diverge from there. Or am I doing my math wrong?
    CorrectHorseBat3 days ago
    Information is ~log(possible states) according to Shannon.
    Log(25)/log(4) is 2.3. Among other things this definition has the nice property that two disks/pages/drives/bits together contain the sum of the capacities instead of their product.
    baobun3 days ago
    To expand if it helps intuition: As data data density grows, the largest representable number grows exponentially.
    3 days ago
    undefined
benhurmarcel3 days ago
I have this type of issue professionally too, even though we don't use paper. For regulatory reasons, the only approved format we are allowed to use for long term archiving is PDF/A. No attachments, only pages in a single PDF document.
It has shown to be an issue for including data, or spreadsheets. Most colleagues just print Excel files to a PDF that gets appended, but while it complies with the regulation it's basically unusable as-is.
- anthk3 days ago
  DjVU should be the standard format.
rickcarlino4 days ago
I got curious about OCR as a sort of poor man’s microfiche. I printed a test paragraph on high quality paper with a laser printer. The smallest font I could read under a USB microscope was 2.5pt, though I could probably have gone smaller if I used polymer paper. The fibers of the paper are quite apparent under a microscope. Transparency film paper was too smudgy.
lifthrasiir3 days ago
I pondered this from time to time and concluded that paper data storage is of very limited use, mainly because of the information density. Any remotely human-readable form is too sprase to be useful (<10 KB/page), while dot-based or color-based approaches are heavily limited by printing techniques (<500 KB/page). It is hard to preserve paper, unless you are willing to sacrifice its information density even more.
For this reason, paper is at best useful as a bootstrapping mechanism, which would allow readers to construct a mechanism to read more densely encoded data. My best guess is that the main storage of information in this case would likely be microfilms, which should be at least 100x dense than the ideal paper data storage. Higher density allows for using less dense encodings to aid readers. And as far as I know microfilms are no harder to preserve than papers.
- pastage3 days ago
  It is degrading too fast, microfilm archives need to be digitilized now, the solvent and image chemicals and media are all part of the problem with microfilms. Archival paper is a nice medium that can be stored a long time. This is of course a question of how long you want to store your information if you want to do 00500 years it is probably good.
  Or just go with metal https://rosettaproject.org/
  Or try to create a culture for humans and store information in that.
  - xyzzy1233 days ago
    Metal engraving fairly accessible these days.
    Fiber laser in 100W range would do it, maybe $10k?
    You could do photochemical etching but would be more fuss and wouldn't last as long as a laser engraving.
    Probably looking at order of 1gig/1000kg if using 1mm 316 plate (napkin math only, naive estimate). Interesting to explore.
    WorldPeas2 days ago
    I would wonder if glass/plastic would be viable given the availability of dvd/modisc burning lasers (though the format is kneecapped by its issues with glue). Is there any good literature about “burning” nonuniform durable materials in a rotary disc burner or am I off base here wrt the capabilities of these smaller lasers
    xyzzy1232 days ago
    > Is there any good literature about “burning” nonuniform durable materials in a rotary disc burner
    Not that I'm aware of. A DVD write laser is maybe 200mW so it's not going to be able to engrave most materials or do it VERY slowly at best. The spot size is ridiculously small though (this is good) so they are still interesting.
    Most people interested in light wood or thin plastic applications have moved on to the small 5-20W diode laser form factor, these are available for a few hundred dollars if you aren't too worried about safety (e.g. no kids in the house). Something with a proper enclosure, interlocks and air handling costs more but still sub $1k. The spot size is much bigger than a DVD laser though; you can't get anything like the same resolution.
    Modding a DVD laser has much higher hack value but it seems to have gone out of style as hobby lasers became widely available as a product.
    Re: materials, if you are not on the "happy path" (material supported by manufacturer or specifically designed for laser) you have to get samples and test.
    There's a few different interactions with laser spot size, wavelength, power, passes etc and the material which means different people (with different systems) tend to get different results. The variability limits the "shareability" of results; probably the biggest sources material info / laser settings are in the forums of the laser manufacturers because it makes the most sense to share settings with other users of the same system.
    As you noted, glue and nonuniformity are a big thing, most materials aren't designed to be burned / vaporised. For glass specifically I think the most practical way would be a CO2 laser which is different again.
  - lifthrasiir3 days ago
    Maybe. Anything that can be photographically etched and is durable enough would work well.
  - tokai3 days ago
    This. The right paper will last significantly longer than microfilm.
- IAmBroom3 days ago
  > It is hard to preserve paper, unless you are willing to sacrifice its information density even more.
  We have paper books from 500 years ago. Microfiche is already deteriorating.
  If you keep paper dry and flat, and use pH-neutral inks and paper, it is extremely stable.
  - 01HNNWZ0MV43FF3 days ago
    Dry and flat... Laminated? Or will the plastic degrade quicker than the paper?
    mystified50163 days ago
    Likely the adhesive in the laminate would degrade the paper over long periods. Lamination also causes additional physical stress on the contained page when handled.
    I'd also expect the plastics to go yellow and opaque over long periods, and recovering the document without damage may be difficult or impossible
- bbarnett3 days ago
  I wonder, as others have said, an easily OCRable font. However, maybe an added compressor, zip type program specially designed for the limited character set.
  If we just have text files, and mayve vector graphics for simple schematics, that's a lot of info.
  - ctrlc-root3 days ago
    Those fonts do exist:
    https://en.wikipedia.org/wiki/OCR-A
    https://en.wikipedia.org/wiki/OCR-B
account-53 days ago
Color Dot Encodings is interesting, you could encode data in a floor mozaic. And with my limited understanding the more colours the high the amount of data?
You could encode data in monolithic structures this way. They'd last longer than paper and given future generations lots of confusion trying to figure out the meaning.
- datadrivenangel3 days ago
  Except when the colors fade over time and people steal the purple ones to decorate their homes preferentially.
  - goda903 days ago
    Just "backup" the data with duplication. For example you could color the floor beneath the mosaic, and the grout used for each tile, so as each layer is removed or faded, it still lasts a little longer. Duplicate your mosaic on both the floor and the ceiling. Duplicate your mosaic in multiple buildings in multiple cities.
65103 days ago
I haven't build it because it costs a bit to much for my budget but someone some day should build the megalithic computer according to my vision: We take a river flowing down a mountain in a suitable location and carve out square canals. The AND gate is done by having a giant door attached to two blocks hollowed out from the bottom. If both blocks are submerged in water together they lift the door and water may flow into the rest of the circuitry. A grid of basins functions as the display and to store values. The input is done by putting weights onto the floating blocks thereby preventing them from lifting specific doors. I doubt it can be made large enough to run doom but it doesn't hurt to be ambitious.
- kragen3 days ago
  This is a brilliant idea, not because it is practical but because it is not.
mk_stjames3 days ago
70-100 kilobytes on a single sheet of paper by tiling QR codes is pretty dense.
I find it interesting that, if you were print 4 sheets double sided you would have roughly the same amount of information stored as a 720kb 5 1/4" floppy disk and if you cut and folded it would take up roughly the same size and weight.
ryukoposting3 days ago
Fun fact: magazines actually distributed software on paper briefly in the 1980s.
https://youtu.be/mIGotStRCkA?si=toG5xeLMZzjIGTxC
It's more like a long, linear barcode, but still. More often, they put the source code in the magazine and you'd just type it into your machine.
- lizknope3 days ago
  I typed in a lot of Atari BASIC code from magazines but I never heard of this. Really cool!
- Ghoelian3 days ago
  Oh yeah, I forgot all about those! The ZX spectrum was way before my time, but for some reason I still spent a lot of time typing code over from a magazine into a spectrum emulator as a kid.
bn-l3 days ago
I did my own testing of this. I arrived at using very large QR codes with a lot of redundancy. You can scratch them etc and they’ll still read. Also it’s an extremely ubiquitous format and everyone knows what it is by looking at it.
zvr3 days ago
Interesting.
I am not sure why, for character-based encodings, they used a general-purpose font (Inconsolata) rather than one that is specifically made for OCR -- and how this would have made it better.
Going further, if you only print a limited alphabet (16, 32 or 39 symbols) why not use a specialized font with only these characters? The final step is to use a bitmap "font" that simply shows different character values as different bit patterns.
- upofadown3 days ago
  The font choice is discussed here:
  https://www.monperrus.net/martin/perfect-ocr-digital-data
  From the linked article:
  >The optimal font varies very much on the considered engine. Monospaced fonts (aka fixed-width) such as Inconsolata, are more appropriate in general. ocr-a and ocr-b ocrb give really poor results.
  I noticed that they liked using lower case letters for bases where that is a choice. I would think that the larger, upper case letters would be better for OCR. Using lower case for either OCR-A or OCR-B would be a poor idea in any case. The good OCR properties are only provided for the upper case letters. The lower case letters were mostly provided for completeness.
  Also, the author might be training on entire blocks of characters rather than individual characters. That isn't really want you want here unless you are using something like words for your representation. OCR-A and OCR-B were designed for character by character OCR.
- numpad03 days ago
  Is there like, "from pynebraskaguyocr import decodeocra" just out there? I haven't seen any of those for some reason.
pmontra4 days ago
The post is 504 now. Alternative link: http://archive.today/N9ZTb
- IAmBroom3 days ago
  I archived it by printing screenshots on my Brother printer, if anyone wants me to snail-mail them the original.
talles3 days ago
I'm imagining that for most of the examples you have to own a printer/scanner with better than average resolution and that the paper would only work if in pristine condition, considering how small the visual details are.
calrain3 days ago
First, encode data as an image. Second, work out the best image for printing/scanning purposes.
I saw some work a while ago of storing SQL extracted table data as an image, and always thought that with good compression and a good printer, you could make paper copies.
Hyperlisk3 days ago
Here is some related software from many years back: https://ollydbg.de/Paperbak/index.html
eimrine7 days ago
Thank you for sharing! I would like to get deeper: how many Bytes is possible to write on a paper with this or that encoding, how about having some extra bits for the sake of data loss recovery, what are approaches to a multi-page storages and are there any patches for incremental archiving?
I will try to remove dust from my A4 scanner and try to read that MP3 from printed medium, seems a bit insane to store multimedia in a paper but who needs to store it without proven ability to read. My printers love to mess with ink (especially ones with pirate-refilled cartridge) so I do not really believe this is practically at maximum resolution.
- fourthark4 days ago
  He covers error correction and information densities on the linked page:
  https://www.monperrus.net/martin/perfect-ocr-digital-data
  (Last section before conclusion.)
  IIUC this provided the best overall reliable information density (at 4.2kb / A4 page).
c0nsumer3 days ago
This reminds me of PaperDisk: https://www.paperdisk.com/id1.html
bob10294 days ago
Something like PDF417 would be what I reach for if I needed a very robust physical representation.
I've seen these barcodes scan accurately off dingy plastic cards using webcams.
The information level per symbol is not great (about 1kb), but the error correction and physical layout work really well.
mihaigalos3 days ago
Excellent article. Cimbar codes can encode glyphs yielding about 10 kB/qrcode, stackable. Ref: https://github.com/sz3/cimbar
slaymaker19073 days ago
I'd always wondered how much data you can store on paper using QR codes given that print media seems to be much better at surviving for long periods of time.
makeworld4 days ago
I wonder if you could add error correction to get around OCR failures.
- bartvk4 days ago
  You could simply add a par2 file but the default setting makes it pretty big. I just tried on an 876 kB Word file. And I got a bunch of par files totaling 1158 kB. The man page says it'll correct up to 100 errors.
  - Joker_vD4 days ago
    You could replicate the Word file twice, using 1752 kB, and it'll correct up to ≈7 million one-bit errors as long as there is at most one error at three equivalent bit offsets.
    two_handfuls3 days ago
    And that is why for error correction, the number of error it is guaranteed to always correct is a pretty important metric too!
  - kragen3 days ago
    How well do par2 files handle insertion errors?
    100 errors in an 876kB file would be about an 0.0012% errror rate. You are going to need another level of ECC before that.
4 days ago
undefined
blueboo3 days ago
I would’ve thought the interesting angle here is about leveraging 1) the 2d space of the page and 2) color in service of compressive encoding.
kragen3 days ago
Acid-free paper has a lot to recommend it: an archival lifespan of about 1000 years, low cost, and widespread equipment for printing it en masse very cheaply. Other media are superior in one or another way, but all are far more expensive.
It's probably worth mentioning https://github.com/za3k/qr-backup/ which is tested in practice rather than merely theoretical. It doesn't achieve very high density, though.
The theoretical information capacity of an uncoated 600dpi laser-printed page ought to be close to 600×600 bits per square inch, 23.6×23.6 bits per square millimeter in modern units. This is 33.7 megabits per US letter page or 34.8 megabits per A4 page. The bit error rate of a laser printer is quite low, under 1%, and the margins are maybe another 5% at most. So modest ECC ought to be able to deliver most of that channel capacity in practice. QR codes and OCR apparently don't come close.
As an exercise, 13 years ago, I designed a proportional 1-bit-deep pixel font for printable ASCII, based on Janne Kujala's work, that averages about 3½×6 pixels. This is about 20 bits per character, so a letter-sized page should hold almost a megabyte of human-readable ASCII text. I generated the King James Bible in it at 600dpi. It comes to about four pages. Printed out in a half-assed way at double size (300dpi) on a 600dpi printer, you can read it pretty easily with a good magnifying glass. I have not yet been able to get an even partly readable printout at full resolution. If someone else tries it, I'm interested in hearing your results.
http://canonical.org/~kragen/bible-columns.png (warning, 93+-megapixel image, 4866×19254)
http://canonical.org/~kragen/bible-columns-320x200.png (small excerpt from the above)
http://canonical.org/~kragen/sw/netbook-misc-devel/6-pixel-1... (the font as a 374×7 image)
http://canonical.org/~kragen/sw/netbook-misc-devel/propfontr... (the image generation program I regret having written in Python because it won't run in current Python)
http://canonical.org/~kragen/sw/netbook-misc-devel/bible-pg1... (test input text, public domain everywhere except the UK)
KWxIUElW8Xt0tD93 days ago
There was once work on metalized optical paper tape of very high density for archival purposes -- what happened to that?
superpupervlad3 days ago
Looking at the title I thought this articles going to be about organising personal notes in notebook
slowhadoken3 days ago
I vote we go back to punch cards.
vubecnevim4 days ago
And now imagine you could save not just text or images or music but also more abstract things like emotions and meaning. And now imagine you can do it in a true human readable form so that many people (not just scholars or priest with secret knowledge) in a post apocalyptic or indeed a normal future can read it. And now imagine that only a limited set of characters even very simple ones can store that data even in just one character measuring just enough space so one can read it. How awesome is that? And we already have this system and it was developed thousads of years ago.
welder3 days ago
If you want the data to last a long time, make sure to print on acid free paper.
tiahura3 days ago
Do any programmers still print reams of green bar paper to review their code?
pknerd3 days ago
But how to decode it back?
pknerd3 days ago
But how to decode it back?
dsign3 days ago
How long before anybody uses this to create those pieces of paper that glue the covers in a hardcover to the paper block? Just saying, with the advent of LLMs, handcrafted paper books may become all the rage again.
drsopp4 days ago
(2020)
- 0xCE03 days ago
  Compare the quality and intentionality of this writing versus 2025 LLM generated/assisted stuff.
Klaus_3 days ago
[dead]
woodstrips3 days ago
[flagged]
woodstrips3 days ago
[flagged]
raulduke3 days ago
[flagged]