Show HN: Duper – The Format That's Super(duper.dev.br)

31 pointsby epiceric3 months ago6 comments

jitl3 months ago
I think a neat route would be to use this as an authoring plugin in VS Code, like prettier: write Duper (or JSON5, or whatever), and then downlevel it to regular json automatically when pressing cmd-s. You wouldn't get to keep your comments (or they could be transformed to { "//": "comment text" }).
Outside of that, it's tough to compete with JSON in the "human readable unschematized serialization format" market, especially targetting JavaScript:
Use in the browser requires some degree of bundle size increase, since the parser code needs to be loaded before your format can be used. WebAssembly libraries are usually quite large compared to a pure-JS implementation. According to [bundlejs](https://bundlejs.com/?q=%40duper-js%2Fwasm&treeshake=%5B*%5D), @duper-js/wasm weighs in at about 488 kB uncompressed, 159 kB gzip.
Use in any JavaScript runtime means you're competing against the runtime's native `JSON.parse` and `JSON.stringify`. In v8, these are very quick and have runtime-level tricks to go faster, for example see [v8's recent post on making JSON.stringify 2x faster](https://v8.dev/blog/json-stringify) when serializing plain objects with no funny business .toJSON methods, replacer, or indent formatting.
Besides those points, my major complaint about JSON is how expensive it is to encode binary data for transmission; in JSON I usually do base64, with your format it's transformed to escape characters that are less efficient than base64, right? \xNN is base16 with 2 extra bytes wasted on the \ and x, or \uNNNN which is base 10 with 2 extra bytes. Is there a way you can fit binary with no expensive encode/decode step into the format?
So, for me this seems suitable as a config file format: there you get good benefit from comments, identifiers, easier string authoring. Not sure I need the binary raw string thingy in config files that much, but I guess it doesn't hurt.
- notpushkin3 months ago
  > I think a neat route would be to use this as an authoring plugin in VS Code, like prettier: write Duper (or JSON5, or whatever),
  This actually somewhat works right now. If you pass this JSON5 example through Prettier:
  { // comments unquoted: 'and you can quote me on that', singleQuotes: 'I can use "double quotes" here', lineBreaks: "Look, Mom! \ No \\n's!", hexadecimal: 0xdecaf, leadingDecimalPoint: .8675309, andTrailing: 8675309., positiveSign: +1, trailingComma: 'in objects', andIn: ['arrays',], "backwardsCompatible": "with JSON", }
  You’ll get:
  { // comments "unquoted": "and you can quote me on that", "singleQuotes": "I can use \"double quotes\" here", "lineBreaks": "Look, Mom! \ No \\n's!", "hexadecimal": 0xdecaf, "leadingDecimalPoint": 0.8675309, "andTrailing": 8675309, "positiveSign": +1, "trailingComma": "in objects", "andIn": ["arrays"], "backwardsCompatible": "with JSON" }
  Which is still invalid JSON... but it does fix unquoted keys, floats, trailing comma, and single → double quote strings with correct escaping. So if you have “format on save” enabled in your editor, it might just work!
- epiceric3 months ago
  Duper certainly won't outperform the native JSON implementation (and it likely never will), though I do think benchmarks would be a great addition. Bundle size and binary representation are definitely things I'll keep in mind!
  The config file transpiration to JSON idea is quite interesting. It's pretty similar to how I'm already defining the TextMate grammar used by the website's syntax highlighter, so I'll certainly try to incorporate that into the tooling.
  - jitl3 months ago
    It may be worth it to pipe Duper into your WASM/native code, and get back plain JSON out, which you then hand off to the runtime's `JSON.parse` with a post-processing step to support any special features needed. Something like this:
    // idea of implementing public duper.parse function to lean on // runtime's JSON.parse // // downlevel to json, eg binary strings become base64 normal json strings const { jsonString, enhancements } = duper.duperToJSON(data) // let the runtime go fast when decoding const rawObject = JSON.parse(jsonString) // `enhance` knows the paths to all the binary base64 strings // and replaces them with Uint8Arrays const decoded = duper.enhance(rawObject, enhancements)
    Here enhancements is something very easy / low cost to construct over the FFI bridge, like
    type Path = Array<string | number> type TransformFn = (value: unknown) => unknown type Transform = TransformFn | Enhancements type Enhancements = Array<[path: Path, transform: Transform]>
    Not sure if this would end up faster, it may allocate more, but it's probably better than unoptimized object/array construction from WASM/native -> runtime. You could also try with a `reviver` argument to JSON.parse but i always find the lack of full path to key somewhat clunky.
aappleby3 months ago
Where the ** is the grammar specification? Prose is nice, but with a BNF I could plug this into my parsing expression grammar library right quick and give it a rundown.
- epiceric3 months ago
  Good point. I'll see about making one.
notpushkin3 months ago
https://xkcd.com/927/
- hshdhdhehd3 months ago
  The X on the date time support means we need a new standard :)
- 3 months ago
  undefined
anilgulecha3 months ago
The object notation format that's going to win is the one that's going to maximally support LLM output. I've come across BAML before, but it's not widely used for some reason.
Today JSON is winning, but for more complex structures, there's still syntax issues in output. XML does reasonably well (given the deep react jsx/HTML in the training corpos), so perhaps that will make a comeback.
Are there benchmarks on this? I think the SOTA models are fine -- they can work with most models, but the fun is that models that are 90% of SOTA performance and cost 90% less - which output format do they work best with. This is where the winner will be found.
TLDR: probably JSON or XML will remain the config format for a while.
ACAVJW4H3 months ago
Nice work this actually looks great. Of course, it’s only a matter of time before someone drops the XKCD about standards proliferation, so I’ll save them the trouble. Pre-emptive XKCD #927 deployed.
anonzzzies3 months ago
Why no date and time?
- epiceric3 months ago
  My reasoning is that they are normally transmitted as strings in JSON, and you could use an identifier like DateTime("2025-11-02T02:33:00Z") if you need to be explicit.
  Making them part of the language would increase the complexity of parsers - how would you validate that a date is actually valid? It's doable (YAML and TOML do it, after all) but requires extra steps.
  - epiceric3 months ago
    Although given the feedback I've received, date/time might get included into the format.
    jitl3 months ago
    note that a DateTime w/ a UTC offset is significantly different from a DateTime w/ a TimeZone (+ optional Calendar), aka ZonedDateTime. ZonedDateTime(July 26, 2035 10:15:32pm in Instanbul) may not necessarily always be at today's value of Instant(July 26, 2035 10:15:32pm in Instanbul). If you are going to support date/time, you should not use the word "DateTime", "Date", "Time" in a way that is ambiguous (is it a ZonedDateTime, or an Instant?), or forget to include support for ZonedDateTime.
    MDN page on JavaScript's Temporal library gives a good overview of the difference between the two; today's practice of encoding Instants as ISO 8601 strings in UTC (Z suffix) or at a UTC offset is okay for ephemeral data-in-motion that will be used right now, but is not a good practice for persisted data since time zones, DST rules, etc change all the time. Temporal is the JS-specific API, but these concepts apply to all handling of date/time/etc data in computer systems.
    That said, v8 plans to use [temporal_rs][] as their Temporal backend.
    Temporal: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...
    temporal_rs: https://crates.io/crates/temporal_rs
    You can encode extended ZonedDateTime information to string following this RFC [Date and Time on the Internet: Timestamps with Additional Information](https://www.rfc-editor.org/rfc/rfc9557.txt)
    djfobbz3 months ago
    DateTime handling, especially with timezone offsets, is crucial. If your format gets that right, it'll stand out...most formats still mess up time zones or rely on loose string parsing. It's key for stuff like logs, scheduling, or syncing data across systems. DuperGZ right after that! ;)