> Clustered primary keys. Without these, some types of queries can become rather...

SigmundA · on Feb 25, 2022

Clustered indexes can be overused but they are sorely missing from PG, having had them forever in SQL server it was another surprise to see they don't exist in PG and there is even a confusing CLUSTER command to reorder the heap table but does not actually make a clustered index.

Clustered index are great for storage space and IO for common write and retrieval patterns. If your table is accessed in one way most of the time and that way needs to be fast and therefore requires an index a clustered index saves write IO (only writing the table no secondary index) disk space (no secondary index redundantly storing the indexed columns) and retrieval time (no indirection when querying the clustered index).

This is great for certain kinds of tables, for instance log tables that need to written to quickly but queried quickly (usually by time range) and grow quite large and are append only. GIS tables which can also get quite large and are by points can be packed really tight and row found quickly. Entity tables that are mostly retrieved via primary key ID during app usage, you trade a slight performance penalty when using secondary index for maximum performance when retrieving by ID which effect everything in including foreign key checks this again saves space on writing the ID twice to disk. Tables that actually are an index for something else such that the table is kept up to date with a trigger etc and usually exist for specific access pattern and performance.

castorp · on Feb 25, 2022

I have been working with Oracle for more than 20 years now. I think I only had very few situations where an "index organized table" (=clustered index) was useful or actually provided a major performance benefit over a "heap table". So I never really miss them in Postgres.

SigmundA · on Feb 25, 2022

It is such a common access pattern that many database engines always have a clustered index (MySql - InnoDB, Sqlite) whether you use them directly or not.

I like having a choice as there is in Sql Server or Oracle, but for many use cases its a waste to write to a heap and to an index (which is just a hidden IOT) then look up in the index and dereference to the heap both in space and time.

electroly · on Feb 25, 2022

> Well, you can’t do that with an equals filter. But how often do you use non-equals filters like > or < on the primary key?

You can; he's wrong. He missed the vastly more common scenario: your primary key is a composite key and you're filtering on a prefix of that composite key. We do that all the time. No < or > required; you get a range of keys with an equijoin. It's additionally common for the first key in the composite key to be some kind of timestamp or monotonically increasing value, so new writes always go together at the end of the table. This sort of key is absolutely begging to be used with a clustered primary key index.

We have a table with 100+ billion rows that uses this pattern. It uses partition switching to append new data. At this size it is absolutely imperative to reduce the number of indexes to keep insert performance high, and we are always querying multiple consecutive rows based on a prefix equijoin. I'd be a fool to follow the author's advice here.

I suspect the author and I live in very different worlds, and he doesn't know my world exists.