A prime example of premature optimization. *Permanent identifiers should not car...

ralferoo · 2025-12-15T17:10:33 1765818633

> Norwegian PNs have your birth date (in DDMMYY format) as the first six digits. Surely that doesn't change, right? Well, wrong, since although the date doesn't change, your knowledge of it might. Immigrants who didn't know their exact date of birth got assigned 1. Jan by default... And then people with actual birthdays on 1 Jan got told, "sorry, you can't have that as birth date, we've run out of numbers in that series!"

To me, what your example really shows is the problem with incorrect default values, not a problem with encoding data into a key per se. If they'd chosen a non-date for unknown values, maybe 00 or 99 for day or month components, then the issue you described would disappear.

But in case, the intention for encoding a timestamp into a UUID isn't for any implied meaning. It's both to guarantee uniqueness with a side effect that IDs are more or less monotonically increasing. Whether this is actually desirable depends on your application, but generally if the application is as a indexed key for insertion into a database, it's usually more useful for performance than a fully random ID as it avoids rewriting lots of leaf-nodes of B-trees. If you insert a load of these such keys, it forms a cluster on one side of the tree that can the rebalance with only the top levels needing to be rewritten.

nonethewiser · 2025-12-15T19:19:26 1765826366

>To me, what your example really shows is the problem with incorrect default values, not a problem with encoding data into a key per se. If they'd chosen a non-date for unknown values, maybe 00 or 99 for day or month components, then the issue you described would disappear.

You still have that problem from organic birthdays and also the problem of needing to change ids to correct birth dates.

jonny_eh · 2025-12-15T21:45:02 1765835102

To add to that, birthdays can clump, just like any seemingly "random" data.

Dylan16807 · 2025-12-15T23:15:10 1765840510

Not significantly. For actual births, a couple holidays have very low rates but clumping into much higher rates happens on no dates.

lovich · 2025-12-15T22:30:58 1765837858

A million dots scattered randomly over a graph can all land on the exact same coordinate if it’s truly random.

What most people intuit as random is some sort of noise function that is generally dispersed and doesn’t trigger the pattern matching part of their brain

Dylan16807 · 2025-12-15T23:18:51 1765840731

> A million dots scattered randomly over a graph can all land on the exact same coordinate if it’s truly random.

It won't happen though. 0.00000000% chance it happens even once in a trillion attempts.

> What most people intuit as random is some sort of noise function that is generally dispersed and doesn’t trigger the pattern matching part of their brain

Yes, people intuit the texture of random wrong in a situation where most buckets are empty. But when you have orders of magnitude more events than buckets, that effect doesn't apply. You get pretty even results that people expect.

lovich · 2025-12-16T00:24:34 1765844674

> It won't happen though. 0.00000000% chance it happens even once in a trillion attempts.

It has the same odds as any other specific configuration of randomly assigned dots. The overly active human pattern matching behavior is the only reason it would be treated as special.

coldtea · 2025-12-16T01:30:08 1765848608

>It has the same odds as any other specific configuration of randomly assigned dots

Which doesn't change anything in practice, since it having "the same odds as any other specific configuration" ignores the fact that more scattered configurations are still far more numerous than it (or even from ones with more visual order in general) taken all together.

>The overly active human pattern matching behavior is the only reason it would be treated as special.

Nope, it's also the fact that it is ONE configuration, whereas all the rest are much much larger number. That's enough to make this specific configuration ultra rare in comparison (since we don't compare it to each other but to all others put together).

FartinMowler · 2025-12-16T02:57:53 1765853873

Lol, reminds me of a story: at his workplace my brother was invited to join a lottery ticket pool where each got to pick the numbers for a ticket. The numbers he picked were 1-2-3-4-5-6. Although the others, mostly fellow engineers, reluctantly agreed his numbers were as likely as the others, after a couple of weeks they neglected to invite him again.

Dylan16807 · 2025-12-16T01:13:19 1765847599

Entropy says it's special. If you have a million dots and 10,000 coordinates, you have 10,000 ways for all the dots to land in the same coordinate, and a zillion kavillion stupillion ways to have somewhere near 100 dots in each coordinate.

lovich · 2025-12-16T02:48:43 1765853323

No, if its randomly distributed then every specific configuration has the same exact chance of happening.

I am laughing at all the people coming out of the woodwork to reply to my original post in this thread misunderstanding randomness and chance.

If you flip a coin a million times and it lands on head every single time, the millionth and 1 time still has a 50/50 chance of landing on heads

Dylan16807 · 2025-12-16T03:10:30 1765854630

> every specific configuration

Who said anything about specific configurations?

We started this talking about whether things "clump" or not. The result depends on your definition of "clump" but let's say it involves a standard deviation. Different standard deviations have wildly different probabilities, even when every specific configuration has the same probability.

Nobody responding to you is calculating things wrong. We're talking about the shape of the data. Categories. And those categories are different sizes, because they have different numbers of specific configurations in them.

> the millionth and 1 time

I don't see any connection between the above discussion and the gambler's fallacy?

tyre · 2025-12-15T17:22:58 1765819378

And then have to enter/handle a non-date through all systems? How do you know if this non-dated person is over the age of minority? Eligible for a pension?

Maybe the answer is to evenly spread the defaults over 365 days.

ralferoo · 2025-12-15T17:37:15 1765820235

If you don't know their birthday, you can presumably never answer that question in any case.

If you only know the birth year and keyed 99 as the month for unknown, then your algorithm would determine they were of a correct age on the start of the year after that was true, which I guess would be what you want for legal compliance.

If you don't even know if the birth year is correct, then the correct process depends on policy. Maybe they choose any year, maybe they choose the oldest/youngest year they might be, maybe they just encode that as 0000/9999.

Again, if you don't know the birth year of someone, you would have no way of knowing their age. I'm not sure that means that the general policy of putting a birthday into their ID number is flawed.

Many governments re-issue national IDs to the same person with different numbers, which is far less problematic that the many governments who choose to issue the same national ID (looking at you USA with your SSN) to multiple individuals. It doesn't seem like a massive imposition on a person who was originally issued an ID based on not knowing when their birthday to be re-issued a new ID when their birthday was ascertained. Perhaps even give them a choice of keeping the old one knowing it will cause problems, or take the new one instead and having the responsibility to tell people their number had changed.

Presumably the governments that choose to embed the date into a national ID number do so because it's more useful for their purposes to do so than just assigning everyone a random number.

notpushkin · 2025-12-16T01:23:58 1765848238

> or take the new one instead and having the responsibility to tell people their number had changed

Or have the opportunity to scam people into thinking you’re a different person. (E.g. take a $1M loan, go bankrupt, remember your birthday, and take a loan again.)

PunchyHamster · 2025-12-15T23:12:43 1765840363

> To me, what your example really shows is the problem with incorrect default values, not a problem with encoding data into a key per se. If they'd chosen a non-date for unknown values, maybe 00 or 99 for day or month components, then the issue you described would disappear.

well, till you run out of numbers for the immigrants that don't have exact birth date

OptionOfT · 2025-12-15T18:58:54 1765825134

Belgium's national register number is similar:

YY.MM.DD-AAA.BB

In either the AAA or BB component there is something about the gender.

But it does mean that there is a limit of people born per day of a certain gender.

But for a given year, using a moniker will only delay the inevitable. Sure, there are more numbers, but still limited as there are SOME parts that need to reflect reality. Year, gender (if that's still the case?) etc.

hyperman1 · 2025-12-15T20:21:14 1765830074

BB is a mod-97 checksum. The first A of AAA encodes your gender in an even/odd fashion, I forgot if its the first or last A doing that. MM or DD can be 00 if unknown. Also MM has +20 or +40 in some cases.

If you know someones birth date and gender, the INSZ is almost certainly 1 in 500 numbers, with a heavy skew to the lower AAA. Luckily, you can't do much damage with someones number,unlike an USA SSN (but I'd still treat it confidential).

notpushkin · 2025-12-16T01:52:43 1765849963

> I'd still treat it confidential

Estonian isikukood is GYYMMDDNNNC, and is relatively public. You can find mine pretty easily if you know where to look (no spoilers!). It’s relatively harmless.

Kazakh IIN is YYMMDDNNNNNN (where N might have some structure) and is similarly relatively public: e.g. if you’re a sole proprietor, chances are you have to hang your license on the wall, which will have it.

It’s a bit more serious: I’ve got my mail at the post office by just showing a barcode of my IIN to the worker. They usually scan it from an ID, which I don’t have, but I’ve figured out the format and created a .pkpass of my own. Zero questions – here’s your package, no we don’t need your passport either, have a nice day!

(Tangential, but Kazakhs also happen to have the most peculiar post office layout: it looks exactly like a supermarket, where you go in, find your packages (sorted by the tracking number, IIRC), and go to checkout. I’ve never seen it anywhere else.)

croes · 2025-12-15T19:42:06 1765827726

> If they'd chosen a non-date for unknown values, maybe 00 or 99 for day or month components, then the issue you described would disappear

> Norwegian PNs have your birth date (in DDMMYY format) as the first six digits.

You can already feel the disaster rising because sone program expects always the latter.

And it doesn’t fix the problem, it just makes it less likely.

tacone · 2025-12-15T12:23:43 1765801423

Fantastic real life example. Italian PNs carry also the gender, which something you can change surgically, and you'll eventually run into the issue when operating at scale.

I don't agree with the absolute statement, though. Permanent identifiers should not generally carry data. There are situations where you want to have a way to reconciliate, you have space or speed constraints, so you may accept the trade off, md5 your data and store it in a primary index as a UUID. Your index will fragment and thus you will vacuum, but life will still be good overall.

mckirk · 2025-12-15T12:29:21 1765801761

I'm not sure whether that was intended, but 'operating at scale' actually made me laugh out loud :D

benterix · 2025-12-15T17:25:49 1765819549

I have to admit an unintended chuckle, too.

cozyman · 2025-12-15T16:26:43 1765816003

how does one change their gender surgically?

delichon · 2025-12-15T17:34:39 1765820079

You can't, but since gender isn't defined by anything physical, there's no need.

bigstrat2003 · 2025-12-15T18:38:21 1765823901

That is only true if you're using an extremely idiosyncratic definition of gender. As far as 95% of English speakers are concerned, gender is defined by the body you possess.

defrost · 2025-12-15T22:41:42 1765838502

As far as nigh on 100% of Bugis speakers are concerned there has always been five genders and they'll tell you the words in their language they have for them.

* https://en.wikipedia.org/wiki/Buginese_language

It appears to be a cultural construct.

EnergyAmy · 2025-12-16T04:26:05 1765859165

You and the other person are probably talking past each other. For most people, "gender" is merely the polite way of saying "sex", and that's probably what the other commenter was referring to.

Gender in the sense of "the social roles and norms on top of biological sex" is indeed a construct, though heavily informed by the biology that they're based on. Biological sex is very much real and not a construct.

defrost · 2025-12-16T05:09:04 1765861744

Of course biological sex is real and strongly bimodal with outliers, who ever said otherwise?

EnergyAmy · 2025-12-16T06:06:56 1765865216

Technically correct, but to be specific sex is binary, not merely bimodal. Sex is entirely defined by gametes, and is binary in anisogamous species such as humans. Isogamous species don't have sexes, they have mating types (and often many thousands of them).

There's actually an ideological movement to try to redefine sex based on sex traits instead of gametes, but this ends up being incoherent and useless for the field of biology. Biologists have had to publish papers explaining the fundamentals of their field to counter the ideological narrative:

Why There Are Exactly Two Sexes

https://link.springer.com/article/10.1007/s10508-025-03348-3

That's why I thought it was worth mentioning. Many people are confused because of the culture wars. To bring it back around to the general topic of this thread, it's fine to store someone's sex as a boolean, because sex is binary and immutable. Storing cultural constructs like gender as anything other than an arbitrary string is asking for trouble, though.

defrost · 2025-12-16T12:03:56 1765886636

Reproductive sex is determined by gametes .. sure.

Not all humans are born with the attribute of reproductive sex via gametes.

Hence "biological sex is real and strongly bimodal with outliers" (in humans, it gets odder elsewhere in animal life on earth) it's just not all reproductive sex, nor is all just strictly M or strictly F despite it mostly being one or the other.

> To bring it back around to the general topic of this thread, it's fine to store someone's sex as a boolean, because sex is binary and immutable.

Not in Australia, via a decision that ascended through all levels of the national court system, nor is sex, as you've chosen to define it ("entirely defined by gametes") binary.

Biology is truly messy. It's understandable not everbody truly grasps this.

Colin Wright is pretty much a prop up cardboard "scientist" for the Manhattan Institute (a political conservative think tank).

I tend to run with people with actual field credentials doing real biology and medicine; Michael Alpers, Fiona Stanley, Fiona Wood, et al were my influences.

If Colin Wright scratches your itch for bad biology then by all means run with the one hit wonder who reinforces a preconception untroubled by empiricism.

EnergyAmy · 2025-12-16T14:15:36 1765894536

You can't legislate reality away. If you're tracking biological sex, then it doesn't matter what a court decides. If you're tracking legal fictions then you might.

I look forward to your citation disputing the truth of what he lays out in that paper. In the meantime, feel free to peruse the list here of people affirming the same stance:

https://projectnettie.wordpress.com/

Or someone else:

https://www.nas.org/academic-questions/33/2/in-humans-sex-is...

You should ask the people you run with why no human is born with a body not organized around the production of gametes. You'll notice that when you read about conditions like anorchia or ovarian agenesis, the sex of the person with that condition is not a mystery, it's literally in the name.

Biology is messy indeed, and that's why finding such a universal definition was so useful.

Terr_ · 2025-12-15T18:59:18 1765825158

Does that mean hundreds of years of English-speakers referring to sailing ship as "she" were all part of a conspiracy to hide that ships have jiggly bits? :p

pezezin · 2025-12-16T11:40:42 1765885242

Wait until you find gendered languages (like most languages in Europe) and realize that grammatical gender usually doesn't have anything to do with biological sex :P

lovich · 2025-12-15T22:34:05 1765838045

The only real states of matter are solids, liquids, and gases. Everything else is just woke lunacy.

I am confident in this fact because I learned it in elementary school decades ago and it is impossible for humanity to discover new information that updates our world model. Every English speaker knows that “plasmas” and “Bose-Eisenstein condensates” are made up.

brigandish · 2025-12-16T01:12:56 1765847576

We all await your Nobel for finding a third type of gamete.

lovich · 2025-12-16T02:44:15 1765853055

The person I was responding to was talking about gender, but if you want to talk about biology then

https://en.wikipedia.org/wiki/Intersex#Prevalence

https://en.wikipedia.org/wiki/Klinefelter_syndrome

https://en.wikipedia.org/wiki/XXYY_syndrome

https://en.wikipedia.org/wiki/XXXY_syndrome

https://en.wikipedia.org/wiki/XXXYY_syndrome

https://en.wikipedia.org/wiki/XXXXY_syndrome

https://en.wikipedia.org/wiki/Trisomy_X

I assume you will be one of the advocates for my nobel prize

edit: I'm sorry you specifically mentioned gametes, we can talk about diploids and haploids if you wish and how our bodies are such complicated machines that any sort of error that can occur in our growth is guaranteed to at scale

EnergyAmy · 2025-12-16T04:05:04 1765857904

XXY/etc are all variations within a sex. The above poster is correct to point out that sex is defined entirely by the gamete size that one's body is organized around producing in anisogamous species like humans, and is binary.

Intersex is a misleading term, the better term is https://en.wikipedia.org/wiki/Disorders_of_sex_development. There are male DSDs and female DSDs. Even in the case of ovotestes, you'll have one gamete produced, and the other tissue will be nonfunctional.

lovich · 2025-12-16T04:36:21 1765859781

And yet, the original person I was responding to spoke about gender.

If you are going to step into this argument, please do not move the goalposts

edit: I've triggered the HN censor bot, so editing to apologize to EnergyAmy, they are correct on their point. I am still going to throw back at brigandish that they moved the goalposts

EnergyAmy · 2025-12-16T04:41:39 1765860099

I'm responding specifically to your comment in regards to "but if you want to talk about biology then" followed by a list of biological variations that don't dispute the sex binary. The goalposts are exactly where you've left them.

brigandish · 2025-12-16T05:18:00 1765862280

Not only have you undermined your claim to a Nobel award by showing a spurious understanding of biology, you wrote, quite sarcastically "it is impossible for humanity to discover new information that updates our world model". Well then, we will all await your discovery of that 3rd gamete, or some theory so innovative that it tips this well studied, well understood, uncontested (by any valid competitor) model to the wayside and humanity can revel in this new information, the better model of reality that you promise.

While you're at it, you could tell us all what the scientific discovery was that made gender separate from sex, who found it and when, and what the defining difference is. Did they win a Nobel for that?

I request that in any reply, you refrain from spamming me with Wikipedia links to articles you don't understand and probably haven't read.

WalterSlovotsky · 2025-12-15T17:11:38 1765818698

The preferred method would be gender affirming surgery.

cozyman · 2025-12-15T17:40:21 1765820421

affirming would mean the change has already taken place

BobaFloutist · 2025-12-15T20:12:40 1765829560

Right, because it has. The change in gender identity (or in choosing to make said identity more public )has already taken place, and the surgery seems to affirm that.

barrkel · 2025-12-15T13:25:34 1765805134

Uuid v7 just has a bias in its generation; it isn't carrying information. You're not going to try and extract a timestamp from a uuid.

Random vs time biased uuids are not a decision to shave off ms that you will regret.

Most likely they will be a decision that shaves off seconds (yes, really - especially when you consider locality effects) and you'll regret nothing.

duckerude · 2025-12-15T17:20:42 1765819242

I've worked on a system where ULIDs (not UUIDv7, but similar) were used with a cursor to fetch data in chronological order and then—surprise!—one day records had to be backdated, meaning that either the IDs for those records had to be counterfeited (potentially violating invariants elsewhere) or the fetching had to be made smarter.

You can choose to never make use of that property. But it's tempting.

voidnap · 2025-12-15T19:07:54 1765825674

I made a service using something like a 64 bit wide ULID but there was never a presumption that data is be inserted or updated earlier than the most recent record.

If the domain is modeling something like external events (in my case), and that external timestamp is packed into your primary key, and you support receiving events out of chronological order, then it just follows that you might insert stuff ealrier than you latest record.

You're gonna have problems "backdating" if you mix up time of insertion with when the event you model actually ocurred. Like id you treat those as the same thing when they aren't.

tobyhinloopen · 2025-12-15T15:44:00 1765813440

> You're not going to try and extract a timestamp from a uuid.

I totally used uuidv7s as "inserted at" in a small project and I had methods to find records created between two timestamps that literally converted timestamps to uuidv7 values so I could do "WHERE id BETWEEN a AND b"

jandrewrogers · 2025-12-15T16:13:47 1765815227

> You're not going to try and extract a timestamp from a uuid.

Hyrum's Law suggests that someone will.

ncruces · 2025-12-15T21:51:10 1765835470

> You're not going to try and extract a timestamp from a uuid.

So, random library: https://pkg.go.dev/github.com/google/uuid#UUID.Time

> Time returns the time in 100s of nanoseconds since 15 Oct 1582 encoded in uuid. The time is only defined for version 1, 2, 6 and 7 UUIDs.

bri3d · 2025-12-15T13:46:46 1765806406

> You're not going to try and extract a timestamp from a uuid.

What? The first 48 bits of an UUID7 are a UNIX timestamp.

Whether or not this is a meaningful problem or a benefit to any particular use of UUIDs requires thinking about it; in some cases it’s not to be taken lightly and in others it doesn’t matter at all.

I see what you’re getting at, that ignoring the timestamp aspect makes them “just better UUIDs,” but this ignores security implications and the temptation to partition by high bits (timestamp).

nine_k · 2025-12-15T16:34:33 1765816473

Nobody forces you to use a real Unix timestamp. BTW the original Unix timestamp is 32 bits (expiring in 2038), and now everyone is switching to 64-bit time_t. What 48 bits?

All you need is a guaranteed non-decreasing 48-bit number. A clock is one way to generate it, but I don't see why a UUIDv7 would become invalid if your clock is biased, runs too fast, too slow, or whatever. I would not count on the first 48 bits being a "real" timestamp.

bri3d · 2025-12-15T17:12:17 1765818737

> Nobody forces you to use a real Unix timestamp.

Besides the UUIDv7 specification, that is? Otherwise you have some arbitrary kind of UUID.

> I would not count on the first 48 bits being a "real" timestamp.

I agree; this is the existential hazard under discussion which comes from encoding something that might or might not be data into an opaque identifier.

I personally don't agree as dogmatically with the grandparent post that extraneous data should _not_ be incorporated into primary key identifiers, but I also disagree that "just use UUIDv7 and treat UUIDs as opaque" is a completely plausible solution either.

sroussey · 2025-12-15T22:02:17 1765836137

That is like the HTML specification -- nobody ever puts up a web page that is not conformant. ;p

The idea behind putting some time as prefix was for btree efficiency, but lots of people use client side generation and you can't trust it, and it should not matter because it is just an id not a way of registering time.

nine_k · 2025-12-15T19:14:45 1765826085

I mean, any 32-bit unsigned integer is a valid Unix timestamp up until 19 January 2038, and, by extension, any u64 is, too, for far longer time.

The only promise of Unix timestamps is that they never go back, always increase. This is a property of a sequence of UUIDs, not any particular instance. At most, one might argue that an "utterly valid" UUIDv7 should not contain a timestamp from far future. But I don't see why it can't be any time in the past, as long as the timestamp part does not decrease.

The timestamp aspect may be a part of an additional interface agreement: e.g. "we guarantee that this value is UUIDv7 with the timestamp in UTC, no more than a second off". But I assume that most sane engineers won't offer such a guarantee. The useful guarantee is the non-decreasing nature of the prefix, which allows for sorting.

hnfong · 2025-12-15T12:56:01 1765803361

The curious thing about the article is that, it's definitely premature optimization for smaller databases, but when the database gets to the scale where these optimizations start to matter, you actually don't want to do what they suggest.

Specifically, if your database is small, the performance impact is probably not very noticeable. And if your database is large (eg. to the extent primary keys can't fit within 32-bit int), then you're actually going to have to think about sharding and making the system more distributed... and that's where UUID works better than auto-incrementing ints.

scottlamb · 2025-12-15T20:56:56 1765832216

I agree there's a scale below which this (or any) optimization matters and a scale above which you want your primary key to have locality (in terms of which shard/tablet/... is responsible for the record). But...

* I think there is a wide range in the middle where your database can fit on one machine if you do it well, but it's worth optimizing to use a cheaper machine and/or extend the time until you need to switch to a distributed db. You might hit this middle range soon enough (and/or it might be a painful enough transition) that it's worth thinking about it ahead of time.

* If/when you do switch to a distributed database, you don't always need to rekey everything:

** You can spread existing keys across shards via hashing on lookup or reversing bits. Some databases (e.g. DynamoDB) actually force this.

** Allocating new ids in the old way could be a big problem, but there are ways out. You might be able to switch allocation schemes entirely without clients noticing if your external keys are sufficiently opaque. If you went with UUIDv7 (which addresses some but not all of the article's points), you can just keep using it. If you want to keep using dense(-ish), (mostly-)sequential bigints, you can amortize the latency by reserving blocks at a time.

mkleczek · 2025-12-15T12:31:36 1765801896

This is actually a very deep and interesting topic. Stripping information from an identifier disconnects a piece of data from the real world which means we no longer can match them. But such connection is the sole purpose of keeping the data in the first place. So, what happens next is that the real world tries to adjust and the "data-less" identifier becomes a real world artifact. The situation becomes the same but worse (eg. you don't exist if you don't remember your social security id). In extreme cases people are tattooed with their numbers.

The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.

Ukv · 2025-12-15T15:30:26 1765812626

> Stripping information from an identifier disconnects a piece of data from the real world which means we no longer can match them. But such connection is the sole purpose of keeping the data in the first place.

The identifier is still connected to the user's data, just through the appropriate other fields in the table as opposed to embedded into the identifier itself.

> So, what happens next is that the real world tries to adjust and the "data-less" identifier becomes a real world artifact. The situation becomes the same but worse (eg. you don't exist if you don't remember your social security id). In extreme cases people are tattooed with their numbers.

Using a random UUID as primary key does not mean users have to memorize that UUID. In fact in most cases I don't think there's much reason for it to even be exposed to the user at all.

You can still look up their data from their current email or phone number, for instance. Indexes are not limited to the primary key.

> The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.

A fully random primary key takes into account that things change - since it's not embedding any real-world information. That said I also don't think there's much issue with embedding creation time in the UUID for performance reasons, as the article is suggesting.

marcus_holmes · 2025-12-16T04:53:17 1765860797

> You can still look up their data from their current email or phone number, for instance. Indexes are not limited to the primary key.

This is the key point, I think. Searching is not the same as identifying.

mkleczek · 2025-12-15T17:00:55 1765818055

> Using a random UUID as primary key does not mean users have to memorize that UUID. In fact in most cases I don't think there's much reason for it to even be exposed to the user at all.

So what is such an identifier for? Is it only for some technical purposes (like replication etc.)?

Why bother with UUID at all then for internal identifiers? Sequence number should be enough.

sethhochberg · 2025-12-15T17:35:39 1765820139

"Internal" is a blurry boundary, though - you pick integer sequence numbers and then years on an API gets bolted on to your purely internal database and now your system is vulnerable to enumeration attacks. Does a vendor system where you reference some of your internal data count as "internal"? Is UID 1 the system user that was originally used to provision the system? Better try and attack that one specifically... the list goes on.

UUIDs or other similarly randomized IDs are useful because they don't include any ordering information or imply anything about significance, which is a very safe default despite the performance hits.

There certainly are reasons to avoid them and the article we're commenting on names some good ones, at scale. But I'd argue that if you have those problems you likely have the resources and experience to mitigate the risks, and that true randomly-derived IDs are a safer default for most new systems if you don't have one of the very specific reasons to avoid them.

mkleczek · 2025-12-15T18:05:47 1765821947

> "Internal" is a blurry boundary, though

Not for me :)

"Internal" means "not exposed outside the database" (that includes applications and any other external systems)

demurgos · 2025-12-15T18:39:12 1765823952

Internal means "not exposed outside some boundary". For most people, this boundary encompasses something larger than a single database, and this boundary can change.

Ukv · 2025-12-15T17:40:48 1765820448

UUIDs are good for creating entries concurrently where coordinating between distributed systems may be difficult.

May also be that you don't want to leak information like how many orders are being made, as could be inferred from a `/fetch_order?id=123` API with sequential IDs.

Sequential primary keys are still commonly used though - it's a scenario-dependant trade-off.

mkleczek · 2025-12-15T18:02:28 1765821748

If you expose the identifier outside the database, it is no longer "internal".

Ukv · 2025-12-15T18:51:58 1765824718

Given the chain was:

> > Using a random UUID as primary key does not mean users have to memorize that UUID. [...]

> So what is such an identifier for? [...] Why bother with UUID at all then for internal identifiers?

The context, that you're questioning what they're useful for if not for use by the user, suggests that "internal" means the complement. That is, IDs used by your company and software, and maybe even API calls the website makes, but not anything the user has to know.

Otherwise, if "internal" was intended to mean something stricter (only used by a single non-distributed database, not accessed by any applications using the database, and never will be in the future), then my response is just that many IDs are neither internal in this sense nor intended to be memorized/saved by the user.

everforward · 2025-12-15T15:50:09 1765813809

> The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.

I think artificial and data-less identifiers are the better means of identification that takes into account that things change. They don't have to be the identifier you present to the world, but having them is very useful.

E.g. phone numbers are semi-common identifiers now, but phone numbers change owners for reasons outside of your control. If you use them as an internal identifier, changing them between accounts gets very messy because now you don't have an identifier for the person who used to have that phone number.

It's much cleaner and easier to adapt if each person gets an internal context-less identifier and you use their phone number to convert from their external ID/phone number to an internal ID. The old account still has an identifier, there's just no external identifier that translates to it. Likewise if you have to change your identifier scheme, you can have multiple external IDs that translate to the same internal ID (i.e. you can resolve both their old ID and their new ID to the same internal ID without insanity in the schema).

mkleczek · 2025-12-15T17:21:00 1765819260

> I think artificial and data-less identifiers are the better means of identification that takes into account that things change. They don't have to be the identifier you present to the world, but having them is very useful.

If the only reason you need a surrogate key is to introduce indirection in your internal database design then sequence numbers are enough. There is no need to use UUIDs.

The whole discussion is about externally visible identifiers (ie. identifiers visible to external software, potentially used as a persistent long-term reference to your data).

> E.g. phone numbers are semi-common identifiers now, but phone numbers change owners for reasons outside of your control. If you use them as an internal identifier, changing them between accounts gets very messy because now you don't have an identifier for the person who used to have that phone number.

Introducing surrogate keys (regardless of whether UUIDs or anything else) does not solve any problem in reality. When I come to you and say "My name is X, this is my phone number, this is my e-mail, I want my GDPR records deleted", you still need to be able to find all data that is related to me. Surrogate keys don't help here at all. You either have to be able to solve this issue in the database or you need to have an oracle (ie. a person) that must decide ad-hoc what piece of data is identified by the information I provided.

The key issue here is that you try to model identifiable "entities" in your data model, while it is much better to model "captured information".

So in your example there is no "person" identified by "phone number" but rather "at timestamp X we captured information about a person at the time named Y and using phone number Z". Once you start thinking about your database as structured storage of facts that you can use to infer conclusions, there is much less need for surrogate keys.

dpark · 2025-12-15T20:53:08 1765831988

> So in your example there is no "person" identified by "phone number" but rather "at timestamp X we captured information about a person at the time named Y and using phone number Z". Once you start thinking about your database as structured storage of facts that you can use to infer conclusions, there is much less need for surrogate keys.

This is so needlessly complex that you contradicted yourself immediately. You claim there is no “person” identified but immediately say you have information “about a person”. The fact that you can assert that the information is about a person means that you have identified a person.

Clearly tying data to the person makes things so much easier. I feel like attempting to do what you propose is begging to mess up GDPR erasure.

> “So I got a request from a John Doe to erase all data we recorded for them. They identified themselves by mailing address and current phone number. So we deleted all data we recorded for that phone number.”

> “Did you delete data recorded for their previous phone number?”

> “Uh, what?”

The stubborn refusal to create a persistent identifier makes your job harder, not easier.

everforward · 2025-12-15T18:04:36 1765821876

> If the only reason you need a surrogate key is to introduce indirection in your internal database design then sequence numbers are enough. There is no need to use UUIDs.

The UUID would be an example of an external key (for e.g. preventing crawling keys being easy). This article mentions a few reasons why you may later decide there are better external keys.

> When I come to you and say "My name is X, this is my phone number, this is my e-mail, I want my GDPR records deleted", you still need to be able to find all data that is related to me.

How are you going to trace all those records if the requester has changed their name, phone number and email since they signed up if you don't have a surrogate key? All 3 of those are pretty routine to change. I've changed my email and phone number a few times, and if I got married my name might change as well.

> Once you start thinking about your database as structured storage of facts that you can use to infer conclusions, there is much less need for surrogate keys.

I think that spirals into way more complexity than you're thinking. You get those timestamped records about "we got info about person named Y with phone number Z", and then person Y changes their phone number. Now you're going to start getting records from person named Y with phone number A, but it's the same account. You can record "person named Y changed their phone number from Z to A", and now your queries have to be temporal (i.e. know when that person had what phone number). You could back-update all the records to change Z to A, but that breaks some things (e.g. SMS logs will show that you sent a text to a number that you didn't send it to).

Worse yet, neither names nor phone numbers uniquely identify a person, so it's entirely possible to have records saying "person named Y and phone number Z" that refer to different people if a phone number transfers from a John Doe to a different person named John Doe.

I don't doubt you could do it, but I can't imagine it being worth it. I can't imagine a way to do it that doesn't either a) break records by backdating information that wasn't true back then, or b) require repeated/recursive querying that will hammer the DB (e.g. if someone has had 5 phone numbers, how do you get all the numbers they've had without pulling the latest one to find the last change, and then the one before that, and etc). Those queries are incredibly simple with surrogate keys: "SELECT * FROM phone_number_changes WHERE user_id = blah".

mkleczek · 2025-12-15T19:04:05 1765825445

> The UUID would be an example of an external key (for e.g. preventing crawling keys being easy). This article mentions a few reasons why you may later decide there are better external keys.

So we are talking about "external" keys (ie. visible outside the database). We are back to square one: externally visible surrogate keys are problematic because they are detached from real world information they are supposed to identify and hence don't really identify anything (see my example about GDPR).

It does not matter if they are random or not.

> How are you going to trace all those records if the requester has changed their name, phone number and email since they signed up if you don't have a surrogate key?

And how does surrogate key help? I don't know the surrogate key that identifies my records in your database. Even if you use them internally it is an implementation detail.

If you keep information about the time information was captured, you can at least ask me "what was your phone number last time we've interacted and when was it?"

> I think that spirals into way more complexity than you're thinking.

This complexity is there whether you want it or not and you're not going to eliminate it with surrogate keys. It has to be explicitly taken care of.

DBMSes provide means to tackle this essential complexity: bi-temporal extensions, views, materialized views etc.

Event sourcing is a somewhat convoluted way to attack this problem as well.

> Those queries are incredibly simple with surrogate keys: "SELECT * FROM phone_number_changes WHERE user_id = blah".

Sure, but those queries are useless if you just don't know user_id.

everforward · 2025-12-16T17:33:09 1765906389

> It does not matter if they are random or not.

Again, sometimes it does, the article lists a few of them. Making it harder to scrape, unifying across databases that share a keyspace, etc.

> And how does surrogate key help? I don't know the surrogate key that identifies my records in your database. Even if you use them internally it is an implementation detail.

That surrogate key is linked to literally every other record in the database I have for you. There are near infinite ways for me to convert something you know to that surrogate key. Give me a transaction ID, give me a phone number/email and the rough date you signed up, hell give me your IP address and I can probably work back to a user ID from auth logs.

The point isn't that you know the surrogate key, it's that _everything_ is linked to that surrogate key so if you can give me literally any info you know I can work back to the internal ID.

> This complexity is there whether you want it or not and you're not going to eliminate it with surrogate keys. It has to be explicitly taken care of.

Okay, then lets do an exercise here. A user gives you a transaction ID, and you have to tell them the date they signed up and the date you first billed them. I think yours is going to be way more complicated.

Mine is just something like:

SELECT user_id FROM transactions WHERE transaction_id=X; SELECT transaction_date FROM transactions WHERE user_id=Y ORDER BY transaction_date ASC LIMIT 1; SELECT signup_date FROM users WHERE user_id=Y;

Could be a single query, but you get the idea.

> DBMSes provide means to tackle this essential complexity: bi-temporal extensions, views, materialized views etc.

This kind of proves my point. If you need bi-temporal extensions and materialized views to tell a user what their email address is from a transaction ID, I cannot imagine the absolute mountain of SQL it takes to do something more complicated like calculating revenue per user.

dpark · 2025-12-15T21:09:16 1765832956

> externally visible surrogate keys are problematic because they are detached from real world information they are supposed to identify and hence don't really identify anything (see my example about GDPR).

All IDs are detached from the real world. That’s the core premise of an ID. It’s a bit of information that is unique to someone or something, but it is not that person or thing.

Your phone number is a random number that the phone company points to your phone. Your house has a street name and number that someone decided to assign to it. Your email is an arbitrary label that is used to route mail to some server. Your social security number is some arbitrary id the government assigned you. Even your name is an arbitrary label that your parents assigned to you.

Fundamentally your notion that there is some “real world” identifier is not true. No identifiers are real. They are all abstractions and the question is not whether the “real” identifier is better than a “fake” one, but whether an existing identifier is better than one you create for your system.

I would argue that in most cases, creating your own ID is going to save you headaches in the long term. If you bake SSN or Email or Phone Number throughout your system, you will make it a pain for yourself when inevitably someone needs to change their ID and you have cascading updates needed throughout your entire system.

halffullbrain · 2025-12-15T20:02:51 1765828971

In my country, citizens have an "ID" (a UUID, which most people don't know the value of!) and a social security number which they know - which has all the problems described above). While the social security number may indeed change (doubly assigned numbers, gender reassignment, etc.), the ID needn't change, since it's the same physical person.

Public sector it-systems may use the ID and rely on it not changing.

Private sector it-systems can't look up people by their ID, but only use the social security number for comparisons and lookups, e.g. for wiping records in GDPR "right to be forgotten"-situations. Social security numbers are sortof-useful for that purpose because they are printed on passports, driver's licenses and the like. And they are a problem w.r.t. identity theft, and shouldn't ever be used as an authenticator (we have better methods for that). The person ID isn't useful for identity theft, since it's only used between authorized contexts (disregarding Byzantine scenarios with rogue public-sector actors!). You can't social engineer your way to personal data using that ID unless (safe a few movie-plot scenarios).

So what is internal in this case? The person id is indeed internal to the public sector's it-systems, and useful for tracking information between agencies. They're not useful for Bob or Alice. (They ARE useful for Eve, or other malicious inside actors, but that's a different story, which realistically does require a much higher level of digital maturity across the entire society)

brettgriffin · 2025-12-15T18:36:31 1765823791

> Stripping information from an identifier disconnects a piece of data from the real world which means we no longer can match them. But such connection is the sole purpose of keeping the data in the first place.

The surrogate key's purpose isn't to directly store the natural key's information, rather, it's to provide an index to it.

> The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.

There isn't 'another' - there's just one. The surrogate key. The other pieces of information you're describing are not the means of indexing the data. They are the pieces of data you wish to retrieve.

mkleczek · 2025-12-15T19:23:27 1765826607

Any piece of information that can be used to retrieve something using this index has to be available "outside" your database - ie. to issue a query "give me piece of information identified by X" you have to know X first. If X is only available in your index then you must have another index to retrieve X based on some externally available piece of information Y. And then X becomes useless as an identifier - it just adds a level of indirection that does not solve any information retrieval problem.

That's my whole point: either X becomes a "real world artifact" or it is useless as identifier.

brettgriffin · 2025-12-16T14:35:09 1765895709

That's not really how data is requested. Most of these identifiers are foreign keys - they exist in a larger object graph. Most systems of records are too large for people to associate surrogate keys to anything meaningful - they can easily have hundreds of billions of records.

Rather, users traverse that through that object graph, narrowing a range of keys of interest.

This hacker news article was given a surrogate key, 46272487. From that, you can determine what it links to, the name/date/author of the submission, comments, etc.

46272487 means absolutely nothing to anybody involved. But if you wanted to see submissions from user pil0u, or submissions submissions on 2025-12-15, or submissions pertaining to UUID, 46272487 would in that in that result set. Once 46272487 joins out to all of its other tables, you can populate a list that includes their user name, title, domain, etc.

Do not encode identifying information in unique identifiers! The entire world of software is built on surrogate keys and they work wonderfully.

PunchyHamster · 2025-12-15T23:23:29 1765841009

Identifier is just "a piece of common token system can use to operate on same entity.

You need it. Because it's maybe one lone unchangeable thing. Taking person for example: * date of birth can be changed, if there was error and correction in documents * any and near all of existing physical characteristics can change over time, either due to brain things (deciding to change gender), aging, or accidents (fingerprints no longer apply if you burnt your skin enough) * DNA might be good enough, but that's one fucking long identifier to share and one hard to validate in field.

So an unique ID attached to few other parts to identify current iteration of individual is the best we have, and the best we will get.

vrighter · 2025-12-15T13:54:25 1765806865

You can't take into account the fact that things change when you don't know what those changes might be. You might end up needing to either rebuild a new database, have some painful migration, or support two codepaths to work with both types of keys.

mkleczek · 2025-12-15T16:57:04 1765817824

Network protocol designers know better and by default embed protocol version number in message format spec.

I guess you can assign 3-4 bits for identifier version number as well.

And yes - for long living data dealing with compatibility issues is inevitable so you have to take that into account from the very beginning.

vrighter · 2025-12-16T06:11:19 1765865479

when I designed network protocols this is exactly what I did. I also did so in file formats had to create. But a database primary kea is not somewhere where that can be easily done.

groundzeros2015 · 2025-12-15T15:59:28 1765814368

You can’t design something by trying to anticipate all future changes. things will change and break.

In my personal design sense, I have found keeping away generality actually helps my code last longer (based on more concrete ideas) and easier to change when those days come.

dpark · 2025-12-15T17:26:31 1765819591

In my experience, virtually every time I bake concrete data into identifiers I end up regretting it. This isn’t a case of trying to predict all possible future changes. It’s a case of trying to not repeat the exact same mistake again.

groundzeros2015 · 2025-12-15T20:33:01 1765830781

I don’t disagree with that, I’m disagreeing with this comment that we can’t make protocol or data decisions that might change.

dpark · 2025-12-15T20:56:55 1765832215

I misunderstood then. I interpreted your comment to say that you eschew generalization (e.g. uuids) in favor of concrete data (e.g. names, email addresses) for ids in your designs.

hyperpape · 2025-12-15T12:37:23 1765802243

Your comment is sufficiently generic that it’s impossible to tell what specific part of the article you’re agreeing with, disagreeing with, or expanding upon.

vintermann · 2025-12-15T12:48:02 1765802882

I disagree that performance should be a reason to choose running numbers over guids until you absolutely have to.

I think IDs should not carry information. Yes, that also means I think UUIDv7 was wrong to squeeze a creation date into their ID.

Isn't that clear enough?

mcny · 2025-12-15T13:22:38 1765804958

That's the creation date of that guid though. It doesn't say anything about the entity in question. For example, you might be born in 1987 and yet only get a social security number in 2007 for whatever reason.

So, the fact that there is a date in the uuidv7 does not extend any meaning or significance to the record outside of the database. To infer such a relationship where none exists is the error.

vintermann · 2025-12-15T13:54:01 1765806841

You can argue that, but then what is its purpose? Why should anyone care about the creation date of a by-design completely arbitrary thing?

I bet people will extract that date and use it, and it's hard to imagine use which wouldn't be abuse. To take the example of a PN/SSN and the usual gender bit: do you really want anyone to be able to tell that you got a new ID at that time? What could you suspect if a person born in 1987 got a new PN/SSN around 2022?

Leaks like that, bypassing whatever access control you have in your database, is just one reason to use real random IDs. But it's even a pretty good one in itself.

mcny · 2025-12-15T14:02:18 1765807338

> What could you suspect if a person born in 1987 got a new PN/SSN around 2022?

Thank you for spelling it for me. For the readers, It leaks information that the person is likely not a natural born citizen. The assumption doesn't have to be a hundred percent accurate, There is a way to make that assumption And possibly hold it against you.

And there are probably a million ways that a record created date could be held against you If they don't put it in writing, how will you prove They discriminated against you.

Thinking... I don't have a good answer to this. If data exists, people will extract meaning from it whether rightly or not.

infogulch · 2025-12-15T14:42:04 1765809724

To quote the great Mr Sparrow:

> The only rules that really matter are these: what a man can do and what a man can't do.

When evaluating security matters, it's better to strip off the moral valence entirely ("rightly") and only consider what is possible given the data available.

Another potential concerning implication besides citizenship status: a person changed their id when put in a witness protection program.

majorchord · 2025-12-15T15:01:15 1765810875

> You can argue that, but then what is its purpose? Why should anyone care about the creation date of a by-design completely arbitrary thing?

Pretty sure sorting and filtering them by date/time range in a database is the purpose.

miroljub · 2025-12-15T15:24:40 1765812280

If you need sorting and filtering by date, just add a timestamp to your table instead of misusing an Id column for that.

mixmastamyk · 2025-12-15T16:36:12 1765816572

That happens, in general. The benefit comes when it’s time to look up by uuid only; the prefix is an index to its disk block location.

dpark · 2025-12-15T18:44:34 1765824274

> the prefix is an index to its disk block location

What? This is definitely not the case and can’t be because B-tree nodes change while UUIDs do not.

mixmastamyk · 2025-12-15T18:45:48 1765824348

I didn’t mean that literally, but no longer editable. Was supposed to have “like” etc in there.

dpark · 2025-12-15T19:00:10 1765825210

But UUIDv7 doesn’t change that at all. It doesn’t matter what flavor of UUID you choose. The ID is always “like” an index to a block in that you traverse the tree to find the node. What UUIDv7 does is improve some performance characteristics when creating new entries and potentially for caching.

majorchord · 2025-12-15T16:38:14 1765816694

> just

It is easy to have strong opinions about things you are sheltered from the consequences of.

naasking · 2025-12-15T16:26:39 1765815999

Exactly, be explicit, don't shoehorn multiple purposes into a single column that's supposed to be a largely meaningless unique identifier.

dpark · 2025-12-15T18:42:22 1765824142

That is absolutely not the purpose. The specific purpose of uuidv7 is to optimize for B-Tree characteristics, not so you can craft queries based on the IDs being sequential.

This assumption that you can query across IDs is exactly what is being cautioned against. As soon as you do that, you are talking a dependency on an implementation detail. The contract is that you get a UUID, not that you get 48 bits of timestamp. There are 8 different UUID types and even v7 has more than one variant.

kentm · 2025-12-16T06:31:00 1765866660

B-trees too but also bucketing for formats like delta lake or iceberg, where having ids that cluster will reduce the number of files you need to update.

kentm · 2025-12-16T06:29:24 1765866564

> You can argue that, but then what is its purpose?

The purpose is to reduce randomness while still preserving probability of uniqueness. UUIDv4 come with performance issues when used to bucket data for updates, such as when there used as primary keys in a database.

A database like MySQL or PostgreSQL has sequential ids and you’d use those instead, but if you’re writing something like iceberg tables using Trino/Spark/etc then being able to generate unique ids (without using a data store) that tend to be clustered together is useful.

anamexis · 2025-12-15T14:08:28 1765807708

I would argue that is one of very few situations where leaking the timestamp that the ID was created when you already have the ID is a possible concern at all.

And when working with very large datasets, there are very significant downsides to large, completely random IDs (which is of course what the OP is about).

kube-system · 2025-12-15T15:57:13 1765814233

The time component either has meaning and it should be in its own column, or it doesn't have meaning and it is unnecessary and shouldn't be there at all.

I'm not a normalization fanatic, but we're only talking about 1NF here.

hyperpape · 2025-12-15T13:59:47 1765807187

Those are two unrelated points and the connection between them was unclear in the original post.

hxtk · 2025-12-15T18:37:47 1765823867

When I think "premature optimization," I think of things like making a tradeoff in favor of performance without justification. It could be a sacrifice of readability by writing uglier but more optimized code that's difficult to understand, or spending time researching the optimal write pattern for a database that I could spend developing other things.

I don't think I should ignore what I already know and intentionally pessimize the first draft in the name of avoiding premature optimization.

barrkel · 2025-12-15T13:27:34 1765805254

UUID v7 doesn't squeeze creation date in. If you treat it as anything other than a random sequence in your applications, you're just wrong.

zamadatix · 2025-12-15T14:03:20 1765807400

"What it does" and "what I think you should do with it" should not be treated as equivalent statements.

anamexis · 2025-12-15T13:47:30 1765806450

For what it’s worth, it was also completely unclear to me how you were responding to the article itself. It does not discuss natural keys at all.

GuB-42 · 2025-12-15T18:49:13 1765824553

I don't think the timestamped UUIDs are "carrying data", it is just a heuristic to improve lookup performance. If the timestamp is wrong, it will just run as slow as the non-timestamped UUID.

If you take the gender example, for 99% of people, it is male/female and it won't change, and you can use that for load balancing. But if later, you found out that the gender is not the one you expect for that bucket, no big deal, it will cause a branch misprediction, but instead of happening 50% of the times when you use a random value, it will only happen 1% of the times, significant speedup with no loss in functionality.

delecti · 2025-12-15T21:00:55 1765832455

As soon as you encode imperfect data in an immutable key, you always have to check when you retrieve it. If that piece of data isn't absolutely 100% guaranteed to be perfect, then you have to query both halves of the load balanced DB anyway.

moralestapia · 2025-12-15T19:07:10 1765825630

>and you can use that for load balancing

As long as you're not in China or India around specific years ...

GP's point stands strong.

benterix · 2025-12-15T16:17:06 1765815426

Your comment is valid but is not related to the article.

spoiler · 2025-12-15T16:46:56 1765817216

More broadly, this is the ages old surrogate vs natural key discussion, but yes the comment completely misses the point of the article. I can only assume they didn't read it in full!

vintermann · 2025-12-15T16:55:31 1765817731

The article explicitly argues against the use of GUIDs as primary keys, and I'm arguing for it.

A running number also carries data. Before you know it, someone's relying on the ordering or counting on there not being gaps - or counting the gaps to figure out something they shouldn't.

michaelt · 2025-12-15T21:05:03 1765832703

> A running number also carries data. Before you know it, someone's relying on the ordering or counting on there not being gaps - or counting the gaps to figure out something they shouldn't.

For example, if https://github.com/pytorch/pytorch/issues/111111 can be seen but https://github.com/pytorch/pytorch/issues/111110 can't, someone might infer the existence of a hidden issue relating to a critical security problem.

Whereas if the URL was instead https://github.com/pytorch/pytorch/issues/761500e0-0070-4c0d... that risk would be avoided.

benterix · 2025-12-15T17:23:58 1765819438

> The article explicitly argues against the use of GUIDs as primary keys, and I'm arguing for it.

Let's clarify things.

The author argues against UUIDv4 as primary keys when compared to integers or bigints in large databases for performance reasons.

The examples you give refer to the common mistake of using a non-unique attribute that can be changed for a given entity as a primary key.

konart · 2025-12-15T18:22:29 1765822949

>Before you know it, someone's relying

Do not expose your internal IDs. As simple as that.

dpark · 2025-12-15T18:35:27 1765823727

This came up in the last two threads I read about uuidv7.

This is simply not a meaningful statement. Any ID you expose externally is also an internal ID. Any ID you do not expose is internal-only.

If you expose data in a repeatable way, you still have to choose what IDs to expose, whether that’s the primary key or a secondary key. (In some cases you can avoid exposing keys at all, but those are narrow cases.)

konart · 2025-12-15T19:04:08 1765825448

You have one ID as a primary key. It is used for building relations in your database.

The second ID has nothing to do with internal structure of your data. It is just another field.

You can change your structure however you want (or type of your "internal" IDs) and you don't have to worry about an external consumer. They still get their artificial ID.

dpark · 2025-12-15T19:16:21 1765826181

So what you meant is not to expose the primary key?

That’s a more reasonable statement but I still don’t agree. This feels like one of those “best practices” that people apply without thinking and create pointless complexity.

Don’t expose your primary key if there is a reason to separate your primary key from the externally-exposed key. If your primary key is the form that you want to expose, then you should just expose the primary key. e.g. If your primary key is a UUID, and you create a separate UUID just to expose publicly, you have most likely added useless complexity to your system.

whynotminot · 2025-12-15T23:31:53 1765841513

> create pointless complexity

My exact thought.

A lot else has failed in your system, from access control to API design, if this becomes a problem. Security by obscurity isn’t the answer.

If the only thing between an attacker and your DB is that they can’t guess the IDs you’re already in some serious trouble.

danudey · 2025-12-15T23:23:55 1765841035

Perhaps you can clarify something for me, because I think I'm missing it.

> Norwegian PNs have your birth date (in DDMMYY format) as the first six digits

So presumably the format is DDMMYYXXXXX (for some arbitrary number of X's), where the XXX represents e.g. an automatically incrementing number of some kind?

Which means that if it's DDMMYYXXX then you can only have 1000 people born on DDMMYY, and if it's DDMMYYXXXXX then you can have 100,000 people born on DDMMYY.

So in order for there to be so many such entries in common that people are denied use of their actual birthday, then one of the following must be true:

1. The XXX counter must be extremely small, in order for it to run out as a result of people 'using up' those Jan 1 dates each year

2. The number of people born on Jan 1 or immigrating to Norway without knowledge of their birthday must be colossal

If it was just DDMMXXXXX (no year) then I can see how this system would fall apart rapidly, but when you're dealing with specifically "people born on Jan 1 2014 or who immigrated to Norway and didn't know their birthday and were born on/around 2014 so that was the year chosen" I'm not sure how that becomes a sufficiently large number to cause these issues. Perhaps this only occurs in specific years where huge numbers of poorly-documented refugees are accepted?

(Happy to be educated, as I must be missing something here)

oncallthrow · 2025-12-15T12:15:45 1765800945

It sounds to me like you’re just arguing for premature optimization of another kind (specifically, prematurely changing your entire architecture for edge cases that probably won’t ever happen to you).

vintermann · 2025-12-15T12:43:54 1765802634

If you have an architecture already, obviously it's hard to change and you may want to postpone it until those edge cases which probably won't ever happen to you, happen. But for new architectures, value your own grey hairs over small performance improvements.

cycomanic · 2025-12-15T18:52:20 1765824740

Like the other poster said, this is a problem with default values not encoding the birthday into the personnummer.

I think it also is important to remember the purpose of specific numbers. For instance I would argue a PN without the birthday would be strictly worse. With the current system (I only know the Swedish one, but assume it's the same) I only have to remember a 4 digit (because the number is bdate + unique 4 digits). If we would instead use completely random numbers I would have to remember at least an 8 digit number (and likely to be future proof you'd want at least 9 digits). Sure that's fine for myself (although I suspect some people already struggle with it), but then I also have to remember the numbers for my 2 kids and my partner and things become quickly annoying. Especially, because one doesn't use the numbers often enough that it becomes easy, but still often enough that it becomes annoying to look up, especially when one doesn't always cary their phone with them.

guhcampos · 2025-12-15T19:16:05 1765826165

It's not that bad. Brazilian CPF are 11 numbers and everyone remembers them. You just get use to it =)

PunchyHamster · 2025-12-15T23:16:51 1765840611

The cause is more just "not having enough bits". UUID is 128 bit. You're not running out even if you use part for timestamp, the random part will be big enough.

Like, it's a valid complaint.. just not for discussion at hand.

Also, we do live in reality and while having entirely random one might be perfect from theory of data, in reality having it be prefixed by date have many advantages performance wise.

> Permanent identifiers should not carry data. This is like the cardinal sin of data management

As long as you don't use the data and have actual fields for what's also encoded in UUID, there is absolutely nothing wrong with it, provided there is enough of the random part to get around artifacts in real life data.

sgarland · 2025-12-15T14:04:33 1765807473

> Permanent identifiers should not carry data.

Did you read the article? He doesn’t recommend natural keys, he recommends integer-based surrogates.

> A prime example of premature optimization.

Disagree. Data is sticky, and PKs especially so. Moreover, if you’re going to spend time optimizing anything early on, it should be your data model.

> Don't make decisions you will regret just to shave off a couple of milliseconds!

A bad PK in some databases (InnoDB engine, SQL Server if clustered) can cause query times to go from sub-msec to tens of msec quite easily, especially with cloud solutions where storage isn’t node-local. I don’t just mean a UUID; a BIGINT PK on a 1:M can destroy your latency for the simple reason of needing to fetch a separate page for every record. If instead the PK is a composite of (<linked_id>, id) - e.g. (user_id, id) - where id is a monotonic integer, you’ll have WAY better data locality.

Postgres suffers a different but similar problem with its visibility map lookups.

liuliu · 2025-12-15T19:49:56 1765828196

I read it (and regret it is a waste of my time). Their arguments are:

* integer keys are faster;

* uuidv7 keys are faster;

* if you want obfuscated keys, using integer and do some your own obfuscation (!!!).

I can get on-board of uuidv7 (with the trade-off, of course, on stronger guessability). The integer keys argument is strange. At that point, you need to come up with a custom-built system to avoid id collision in a distribution system and tries to achieve only 2x saving (the absolute minimal you should do is 64-bit keys). Very puzzling suggestion and to me very wrong.

Note that in this entire article, the recommendation is not about using natural keys (email address, some composite of user identification etc.), so I am skipping that whole discussion.

sgarland · 2025-12-16T00:04:19 1765843459

You can hand out chunks of sequential ids from a central coordinator to avoid collision; this is a well-established pattern.

Re: natural keys (or something like it), I was using it as an example of how badly PK choice can impact performance at scale.

liuliu · 2025-12-16T17:47:10 1765907230

> You can hand out chunks of sequential ids from a central coordinator to avoid collision; this is a well-established pattern.

The problem is: is that part of postgresql? If not, someone has to write the buggy code for that well-established pattern. (BTW, I honestly think autoincrement is fine and the choice of PK is so minor you can always pay your way to solve it if you really have a problem at scale).

PunchyHamster · 2025-12-15T23:26:45 1765841205

> (with the trade-off, of course, on stronger guessability).

you're not guessing 2^72 bit random number. And if guessing UUID does something in your app, you already fucked up

lukeschlather · 2025-12-15T16:40:28 1765816828

> Did you read the article? He doesn’t recommend natural keys, he recommends integer-based surrogates.

I am not a cryptographer, but I would want his recommendation reviewed by a cryptographer. And then I would have to implement it. UUIDs have been extensively reviewed by cryptographers, I have a variety of excellent implementations I can use, I know they solve the problem well. I know they can cause performance issues; they're a security feature that is easy to implement, and I can deal with the performance issues if and when they crop up. (Which, in my experience, it's unusual. Even at a large company, most databases I encounter do not have enough data. I will err on the side of security until it becomes a problem, which is a good problem to have.)

alerighi · 2025-12-15T17:06:52 1765818412

Why they are a security feature? They are not, the article even says it. Even if UUID4 are random, nobody guarantees that they are generated with a cryptographically secure random number generator, and in fact most implementations don't!

The reason why in a lot of context you use UUID is when you have a distributed system where you want your client to decide the ID that is then stored in multiple systems that not communicate. This is surely a valid scenario for random UUID.

To me the rule is use UUID as a customer-facing ID for things that has to have an identity (e.g. a user, an order, etc) and expose it publicly through APIs, use integer ID as internal identifier that are used to create relations between entities, and interal IDs are always kept private. That way numeric ID that are more efficient remain inside the database and are used for joining data, UUID is used only for accessing the object from an API (for example) but then internally when joining (where you have to deal with a lot of rows) you can use the more efficient numeric ID.

By the way, I think that the thing of "using UUID" came from NoSQL databases, where surely you use an UUID, but also you don't have to join data. People than transposed a best practice in one scenario to SQL, where its not really that best practice...

lukeschlather · 2025-12-15T17:42:59 1765820579

If a sequential ID is exposed to the client, the client can trivially use it to determine the number of records and the relative age of any records. UUID solves this, and the use of a cryptographically secure number generator isn't really necessary for it to solve this. The author's scheme might be similarly effective, but I trust UUIDs to work well. There are obviously varying ways to hide this information other than UUIDs, but UUIDs are simple and I don't have to think about it, I just get the security benefits. I don't have to worry about not exposing IDs to the clients, I can do it freely.

sgarland · 2025-12-15T20:17:26 1765829846

I have never seen anyone post an actual example of the German Tank problem creating an issue for them, only that it’s possible.

> I don’t have to think about it

And here we have the main problem of most DB issues I deal with on a daily basis - someone didn’t want to think about the implications of what they were doing, and it’s suddenly then my emergency because they have no idea how to address it.

lukeschlather · 2025-12-15T23:32:19 1765841539

If you can predict user IDs this is extremely useful when you're trying to come up with an exploit that might create a privileged user, or perhaps you can create some object you have access to that is owned by users that will be created in the near future.

When I say "I don't have to think about it" I mean I don't have to think about the ways an attacker might be able to predict information about my user ids which they could use to gain access to accounts, because I know they cannot predict information about user ids.

You are dismissing the implications of using something that is less secure than UUIDs and you haven't convinced me I'm the one failing to think through the implications. I know there are performance problems, I know they might require some creative solutions. I am not worried about unpredictable performance issues, I am worried about unpredictable security problems.

sgarland · 2025-12-16T00:01:05 1765843265

Perhaps this is my bias coming through. I work with DBs day in and day out, and the main problem I face is performance from poorly-designed schemas and queries; next largest issue is referential integrity violations causing undefined behavior. The security issues I’ve found were all people doing absurdly basic stuff, like exposing an endpoint that dumped passwords.

To me, if you’re relying on having a matching PK as security, something has already gone wrong. There are ways to provide AuthN and AuthZ other than that. And yes, “defense in depth,” but if your base layer is “we have unguessable user ids,” IME people will become complacent, and break it somewhere else in the stack.

mrkeen · 2025-12-16T05:29:47 1765862987

Here you go:

https://news.ycombinator.com/item?id=46279123

[2025-12-10]

> We generate every valid 7-digit North American phone number, then for every area code, send every number in batches of 40000

> Time to go do something else for a while. Just over 27 hours and one ill-fated attempt at early season ski touring later, the script has finished happily, the logfile is full of entries, and no request has failed or taken longer than 3 seconds. So much for rate limiting. We’ve leaked every Freedom Chat user’s phone number

mzi · 2025-12-15T18:33:52 1765823632

> Even if nothing changes, you can run into trouble. Norwegian PNs have your birth date (in DDMMYY format) as the first six digits. Surely that doesn't change, right?

I guess that Norway has solved it in the same or similar way as Sweden? So a person is identified by the PNR and for those systems that need to track a person over several PNR (government agencies) use PRI. And a PRI is just the first PNR assigned to a person with a 1 inserted in the middle. If that PRI is occupied, use a 2,and so on.

PRI could of course have been a UUID instead.

maxbond · 2025-12-15T23:26:17 1765841177

> Permanent identifiers should not carry data.

Do you have the same criticism for serial identifiers? How about hashes? What about the version field in UUIDs?

scottlamb · 2025-12-15T16:13:29 1765815209

> Permanent identifiers should not carry data.

I think you're attacking a straw man. The article doesn't say "instead of UUIDv4 primary keys, use keys such as birthdays with exposed semantic meaning". On the contrary, they have a section about how to use sequence numbers internally but obfuscated keys externally. (Although I agree with dfox's and formerly_proven's comments [1, 2] that XOR method they proposed for this is terrible. Reuse of a one-time pad is probably the most basic textbook example of bad cryptography. They referred to the values as "obfuscated" so they probably know this. They should have just gone with a better method instead.)

[1] https://news.ycombinator.com/item?id=46272985

[2] https://news.ycombinator.com/item?id=46273325

patmorgan23 · 2025-12-15T17:47:32 1765820852

Insert order or time is information. And if you depend on that information you are going to be really disappointed when back dated records have to be inserted.

scottlamb · 2025-12-15T18:05:20 1765821920

Right, to ensure your clients don't depend on that information, make the key opaque outside the database through methods such as the ones dfox and formerly_proven suggested, as I said.

naasking · 2025-12-15T16:30:22 1765816222

I don't think the objection is that it exposes semantic meaning, but that any meaningful information is contained within the key at all, eg. even a UUID that includes timestamp information about when it was generated is "bad" in a sense, as it leaks information. Unique identifiers should be opaque and inherently meaningless.

scottlamb · 2025-12-15T17:10:04 1765818604

Your understanding is inconsistent with the examples in vintermann's comment. Using a sequence number as an internal-only surrogate key (deliberately opaqued when sent outside the bounds of the database) is not the same as sticking gender identity, birth date, or any natural properties of a book into a broadly shared identifier.

naasking · 2025-12-15T19:16:26 1765826186

No it's not, they very explicitly clarify in follow-up comments that unique identifiers should not be embedded any kind of meaningful content. See:

https://news.ycombinator.com/item?id=46276995

https://news.ycombinator.com/item?id=46273798

scottlamb · 2025-12-15T19:18:31 1765826311

Okay, but they ignore the stuff I was talking about, consistent with my description of this as a straw man attack.

> A running number also carries data. Before you know it, someone's relying on the ordering or counting on there not being gaps - or counting the gaps to figure out something they shouldn't.

The opaquing prevents that.

They also describe this as a "premature optimization". That's half-right: it's an optimization. Having the data to support an optimization, and focusing on optimizing things that are hard to migrate later, is not premature.

asah · 2025-12-15T21:56:31 1765835791

counterpoint: IRL, data values in a system like PostgreSQL are padded to word boundaries so either you're wasting bits or "carrying data."

Traubenfuchs · 2025-12-15T15:21:56 1765812116

> Norwegian PNs have your birth date

Same with Austrian social security numbers, which, in somes cases, don't contain the persons birth date and in some cases don't contain any existing date at all.

Yet many websites enforce a valid date and pull the persons birthdate from it...

oblio · 2025-12-15T13:02:30 1765803750

> Well, wrong, since although the date doesn't change.

Someone should have told Julius Caesar and Gregory XIII that :-p