It has the advantage of being a drop in replacement most places everyone uses v4 today. It also has the advantage over other specs of ulid in that it can be parsed easily even in languages and databases with no libraries because you just need some obvious substr replace and from_hex to extract the timestamp. Other specs typically used some custom lexically sortable base64 or something that always needed a library.
Early drafts of the spec included a few bits to increment if there were local ids generated in the same millisecond for sequencing. This was a good fit for lots of use cases like using the new ids for events generated in normal client apps. Even though it didn’t make the final spec I think it worth implementing as it doesn’t break compatibility
Incrementing a sequence completely kills the purpose of a UUID, and requires serialization/synchronization semantics. If you need that, just use a long integer.
Anyway, as I said, it was dropped from the spec
If you got a spare 50k kicking around we could set up a test system and find out how likely it is to happen...
Using UUID here wouldn’t help here because you don’t want different identifiers for the same content. Time-based UUID versions would negate the point of ETag, and otherwise if you use UUIDv8 and simply put a hash value in there, all you’re doing is reducing the bit depth of the hash and changing its formatting, for limited benefit.
Benefits are readability and reduced amount of data to be transferee. UUID is reasonably save to be unique for the ETag use case (I think 64 bits actually would be enough).
Having the filename be a simple hash of the content guarantees that you don’t make the mistakes above, and makes it trivial to verify.
For example, if my css files are compiled from a build script, and a caching proxy sits in front of my web server, I can set content-hashed files to infinite lifetime on the caching proxy and not worry about invalidating anything. Even if I clean my build output and rebuild, if the resulting css file is identical, it will get the same hash again, automatically. If I used UUID’s and blew away my output folder and rebuilt, suddenly all files have new UUID’s even though their contents are identical, which is wasteful.
For example, hashes are often taken over untrusted data, which could be manipulated to produce a collision.
UUIDs aren't meant to protect against that.
I'm sure RoR just did the straightforward thing, didn't get cute, and called it a day.
Re: monotonicity, as I view it, v7 is the best compromise I can make with devs as a DBRE where the DB isn’t destroyed, and I don’t have to try to make them redesign huge swaths of their app.
Integers are monotonic but can't be distributed like UUIDs.
Unless you make them 128 bits ;)
As usual, most people are not dumb most of the time, even if it seems that way.
They can, to an extent. The use of integers as a primary key has been a solved problem for quite some time, usually by either interleaving distribution among servers, or a coordinator handing chunks out.
If you mean enabling the ability to do joins across physical databases, my counter to that is it’s an unsupported method by any RDBMS, and should be discouraged. You can’t have foreign key constraints across DBs, and without those, I in no way trust the application to consistently do the right thing and maintain referential integrity. I’ve seen too many instances of it going wrong.
The only way I can see it working is something involving Postgres’ FDW, but I’m still not convinced that could maintain atomic updates on its own; maybe with a Pub/Sub in addition? This rapidly gets hideously complicated. Just design good, normalized schema that can maintain performance at scale. Then when/if it doesn’t, shard with something that handles the logic for you and is tested, like Vitess or Citus.
Or imagine two separate databases that get merged.
DBs can return inserted data to you; Postgres and SQLite can return the entire row, and MySQL can return the last generated auto-increment ID.
> Or imagine two separate databases that get merged.
This is sometimes a legitimate need, yes, but it does make me smirk a bit since it goes against the concept of microservices owning their own domain (which I never thought was a great idea for most). However, it’s also quite possible to merge DBs that used integers. Depending on the amount of tables and their relationships (or rather, lack of formally defined ones) it may be challenging, but nothing that can’t be handled.
I mostly just question the desire to dramatically harm DB performance (in the case of UUIDv4) for the sake of undoing future problems more easily.
It was not uncommon for systems to merge datasets, either due to literal M&A or to share records and coordinate care.
A globally unique ID was important, despite not having a globally centralized system.
Assuming you can+want to talk to a database right then.
The useful part of UUIDs is that they can be generated anywhere, locally, remotely, same DB, separate DB, online, offline, and never change.
If you need to perform date/time related operations, use date/time related data types, not an unrelated type that happens to have some arbitrary timestamp embedded in its binary layout.
> Integers are monotonic but can't be distributed like UUIDs.
Yes, use UUIDs if you need distribution, use integers if you need monotonicity. If you need "monotonic and distributed", you need an external authority for proper distribution of those IDs. Then, an integer would still work.
I’ve met more than one architect who hands waves that fact away during a “leaking integers is bad!” campaign
Therefore, given a compliant UUIDv7 sample, it is impossible to interpret those bits. You can't say if they are random or serial without knowing the implementation, or stochastic analysis of consecutive samples. It's a black box.
The standard would be improved if it just said those bits MUST be uniquely generated for a particular timestamp (e.g. with PRNG or atomic counter).
Logically, that's what it already means, and it opens up interesting v8-style application-specific usages of those bits (like encoding type metadata in a small subset, leaving the rest random), while also complying with the otherwise excellent v7 standard.
ex. I'm loading your documents on startup.
Eventually, we're going to display them as a list on your home screen, newest to oldest.
Now, instead of having to parse the entire document to get the modified date, or wait for the file system, I can just sort by the UUID v7 thats in the filename.
Is it perfect? No, ex. we could have a really old doc thats the most recently modified, and the doc ID is a proxy for the creation date.
But its much better than the status quo of "we're parsing 1000+ docs at ~random at startup, please wait 5 seconds for the list to stop updating over and over."
Your parent says they don't want to wait for the file system:
> Now, instead of having to parse the entire document to get the modified date, or wait for the file system, I can just sort by the UUID v7 thats in the filename.
That information comes for free when you're iterating files in a directory. There's no extra waiting than the file name itself because file dates are kept in the same structure that keeps the file names.
Still, I find that justification to rely on certain binary format of an ID format weird. Just use the dates in filenames if you truly need such a mechanism.
You know, like described in the OP.
Is it okay if thats useful?
What is it being relied upon for?
Alternatively, more explicitly, lets look at it from this angle:
Let's follow exactly what you're recommending: parse it from the file.
Then add a fault-tolerant layer in front that parses a UUID-v7 from the filename.
What do you think of that?
Adding layers, etc. Again, it’s a loading UI.
Obviously, we’re talking about a fantasy app here. I’m weighing options based on my understanding of it.
Let's have the fantasy app do exactly as you're recommending.
Now, the fantasy app also happens to store its file using this filename format: {uuid}.json
What objections are there to parsing the uuid from the filename and using it to sort?
Assuming you again mention the filename not be a valid UUID:
Is it possible to account for that and fallback to the safe behavior? :)
Do you understand? On second read, could be too short and unnecessarily Latin-y. :)
You can do any scheme that suits your individual features needs, and it will be a valid UUID still.
This also means future schemes can be implemented right now without having to get a formal UUID version.
You could use the first few bits to indicate production vs qa vs dev data.
Or a subtle hint to what it might be for (eg is this UUID a product identifier or a user identifier or a post or a comment etc). Similar to how AWS etc prefix IDs with their type.
The exception is if you're extracting the time portion of a time-based UUID and using it for purposes other than as a unique key, but in my experience this is typically considered bad practice and time is usually stored in a separate column for cases where it matters for business purposes.
They aren't "versions" so much as variants.