Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Recursive Common Table Expressions in Postgres (citusdata.com)
202 points by jesperht on May 22, 2018 | hide | past | favorite | 97 comments


The only problem I have with them is that it takes my brain 30 minutes to digest the peace of SQL in front of me that uses WITH RECURSIVE clause.

I do whatever I'm supposed to do. 3 months later, a bugfix required, I find myself at this piece of SQL that I know works but I cannot digest it for another 30 minutes.

And my case is really simple.

But yeah they are amazing.


People need to comment SQL more!

We have a few ~200 line long queries in one of our applications. With comments one of them is 460 lines.

When things get hairy, don't be afraid to explain why you needed this CTE, what it's doing from a high level, point out that little gotcha that you ran into a few times while developing the query, why you needed to do X instead of the simpler Y, etc...

While there are downsides to excessive commenting (the big one is that the comments can get "out of sync" with the code and just cause confusion), when it comes to big/complicated SQL, I've found that more is better for the most part.


You can say that again! I started a DB review team at my company. Not only queries, but Postgres allows for tables to be commented on, you can place a comment on table, columns, constraints, domain, triggers, view, pretty much anything.

Initially a few folks rolled their eyes at that, but now everyone loves it. When you have thousands of tables, comments help, likewise when you have massive amount of queries. I'm also in the camp of let the DB do the work if it can do it faster, and likewise have 100-400 line queries that will turn into 5x LOC programs which take 20x time to execute. Writing tons of comments is important. Usually the query is right and when it works, no one remembers it for a year or two till business rules change.

My approach to commenting such queries after modification is to delete every single comment and start afresh. From memory, when I'm done. If I can't then I have no idea what the query was doing which is very bad. Sometimes folks modify query and get a happy result. If I can document it without reference to old one, great! After that, I look at the old one to make sure that I'm not missing anything.


I tend to take the "rubber duck debugging" approach to writing comments for big hairy SQL statements.

After i'm done (I tend not to comment these behemoths until at or near the end), I go and explain what the query is doing, sometimes almost line-by-line, to a fake "rubber duck" as if it were a junior developer. Even sometimes commenting what I didn't do, and why I didn't do it. Because as it turns out, future me is a bit of a cocky asshole that always thinks past-me was some kind of idiot that must have never thought to try that before...


I can totally relate to the future me being a cocky asshole part. I just had this experience coding the other day. I got about 15 minutes in thinking "this will be way better", git mv'ing files and committing etc until I ran into the exact same roadblock and it all came back to me. Now there is a sheepish trail of commits with messages explaining why I was wrong and it wasn't way better. I could have force pushed the commit before I started the changes but I wanted to shame myself so maybe I'll remember next time...


OP here, couldn't agree more regarding commenting[1]. Far too often people don't treat their SQL queries or database the same way they do code. Giving SQL the same attention as you do code with good formatting and comments can make life much easier the next time someone else or even yourself has to come back to some query years later.

[1]. http://www.craigkerstiens.com/2013/07/29/documenting-your-po...


> comments can get "out of sync" with the code and just cause confusion

I generally find it a safe assumption that were the comments don't agree with the code both are wrong, or will be next time someone makes a related change (as sod's law says they'll chose to trust the one that is least correct). Even when all the actors involved are me at different times.


I feel like for queries that are hard to understand at a glance, a picture is worth a thousand words. For the majority of people, myself included, writing helpful, high quality comments is really hard.

Instead, for a CTE, I would rather add a sample of initial data, and how they are transformed by the first two or so iterations of the loop.


> We have a few ~200 line long queries in one of our applications. With comments one of them is 460 lines.

You can decompose large queries into smaller views/temp tables, and such queries will be much more readable..


It depends how reused those subcomponents are, sometimes separating portions of the logic into views can obscure what's going on and end up increasing the LoC you need to parse to grok where a bug may be.

It's a trade off.


With most databases, you don't need to reuse it to get the maintainability win with temporary tables. (The notable negative exception being Oracle.) Furthermore it gives you power to control the query plan - a feature that I've found very helpful with both MySQL and PostgreSQL on occasion.


This is my problem with SQL in general. The facilities for which SQL is an API are incredible, but that API has so many weird design choices behind it that it often feels like I'm being forced to play some kind of perverse minigame in order to get the database to do what I want efficiently. And then, once I've managed to bend the query planner to my will, I often have to load it down with an absurd number comments to explain to future folk what exactly is happening – ostensibly the sort of thing that a declarative language is supposed to prevent.


My advice is to always use an ORM with SQL if you possibly can, or if not an ORM per se, at least an abstraction library like the sqlalchemy "core". It sounds ironic to say but "SQL" is the worst part of SQL. That shouldn't stop you loving SQL though.


It’s fine to use an ORM until you need joins, order by, group by, subqueries, case clauses...

You get the point.


What ORM doesn't handle joins, order by and group by? Most should even support subqueries and case clauses.


Also, using HQL (or equiv) defeats the purpose of using an ORM. Just use SQL.

---

Imagine you're troubleshooting your query, mapping. It's got some modest joins. So you capture the emitted SQL. Then you fuss with that SQL to make it something reasonable. Once that SQL is working, you then wrestle with the ORM to try to coerce it to re-emit your desired SQL.

This is also called reverse engineering, pushing rope, fighting the 800lb angry gorilla sitting between you and your work.

Eventually you'll figure out you should just use SQL, whatever dialect your backend supports.


The purpose of an ORM is to map database query results to objects. You still have to write the queries yourself.

I will agree that HQL has flaws but not the same ones that you admit. It is string based which means it is cumbersome to use, dynamic column filtering is not possible, entering parameters requires 3x dupliation of the. variable name and it isn't type checked at compile time.

But frameworks like grails (which is built on top of hibernate and criteria) or LINQ have solved these problems.

My biggest performance bottlenecks are in batch insert/update performance for which most ORMs already generate optimal queries that don't involve database specific features like "unnest(col1, col2)". But even that could be solved trivially by exposing a simple batch insert/update api to have both the convenience of an ORM and the performance of db specific optimisations.


Um... how many ORMs do you have experience with? I suggest you look at the mentioned sqlalchemy core's ability to map more or less directly to SQL with reduced syntactic ugliness.


There is a fruitful discussion that can be had on helpful ORMs vs harmful ORMs, and it usually ends up describing a category of marshalling helpers that don't quite form a regular ORM as the ideal situation.

My take-home on the whole thing is, you need something that makes working with relations in your code easy but you don't need something attempting a ridiculous abstraction like "relations are objects."


Just so I understand you:

You're saying you've never captured the generated SQL, debugged and tweaked it, and then backported that SQL to your ORM obfuscation layer?


How would an ORM help with a WITH RECURSIVE clause?


I don't think relinquishing more control is the answer.


Perhaps someone should design a purely functional language that translates to SQL (?)


See Haskell's Esqueleto, Scala's Slick and Quill, C#/F#'s LINQ to SQL + Type Providers, probably others.

In the case of Haskell and Scala the aforementioned query DSLs serve as CTE generators without the performance hit (albeit non-recursive).

Being able to compose complex statements based on statically typed query snippets is really, really nice -- strips out the reams of repeated boilerplate that one inevitably is forced to write with string-y SQL.


Interesting. Could these be extended to support recursion using these new recursive CTEs? Or would they miss some fundamental mechanism, such as the ability to pass parameters down the recursive queries?


> Could these be extended to support recursion using these new recursive CTEs?

Sure, in the end they generate (prepared) sql statements; CTEs are just another statement with an expected result type.

As for supporting parameters in a recursive CTE, I don't think providing runtime values would be an issue.

    WITH RECURSIVE $name (n) AS (
        SELECT $init
      UNION ALL
        SELECT n + $incr FROM $name WHERE n + $incr <= $limit
    )
    SELECT n FROM $name;
Could use it right now with built-in string interpolator, but not sure how this would look in DSL form since recursive CTEs can support both real and temporary tables (i.e. the DSL has to know the structure of the temporary table in order to statically determine the result type of the query).


A side project I have stewing in my head is to try to hack something into a Postgres plugin to provide a nice AST format for producing specific query plans. I think this can be done, given that there are some old plugins that allow stuff like saving query plans, but I haven't looked much into the details yet. My thinking is that once you have that, you open the door for a lot of experimentation into query language alternatives that still maintain the power of SQL. And since you're cutting the query planner out of the equation (and moving it instead into a compilation step), you can hopefully take some of the frustrating black magic out of it.

Key assumptions I'm working from here are:

1. Actual programmers (in contrast with SQL's original target audience) generally know exactly what data access patterns are appropriate for a particular task; they just don't want to write them from scratch and have to worry about stuff like locking and transactions at a fine grained level.

2. Being able to switch plans on the fly based on query input and data statistics is not a huge win in most real world scenarios, and isn't worth the unpredictability that it entails.

3. Most of the shallow pain of working with SQL comes from the fact that it was a strange paradigm to start with, and then had a bunch of features hacked on top of it.

4. Most of the deep pain of working with SQL comes from the fact that it abstracts too much, forcing you to reverse engineer the desired query plan through those abstraction layers. And like a lot of reverse engineering, the result is fragile and might change to something far less efficient in the future for inscrutable reasons like database upgrades or subtle changes in how the data is shaped.

Pretty much any replacement can address 3. Ideas that really excite me address 1 and 4, and it seems to me that those are more likely to look procedural than purely functional. But like I said, the main thing is that having a low level target to compile to would allow us to try out different ideas and see what works.


Speaking as someone who has worked as a PostgreSQL DBA for over a decade, in environments ranging from hundreds of queries per second to hundreds of thousands, I can say with confidence that your assumptions do not map to my lived reality, or that of any other DBA with whom I've spoken about the challenges of working with developers who think the database is a black-box dump-truck they neither have to understand how to use well, nor care about the implications of using poorly.

1. "Actual programmers" tend to have net negative clue about which data access patterns are appropriate. It is, for example, staggeringly common for me to have to explain that a "table scan" is often more efficient than random IO ("index scan") — even on NAND media — if you're reading more than some threshold of the data in a table.

If you (the general "you") understand data access patterns that poorly, no, you absolutely should not be dictating query plans. If you think "hav[ing] to worry about stuff like locking and transactions" is an imposition, you don't want me to sit in your interview. I will hard pass.

2. See above.

3. The relational algebra is not a "strange paradigm". It's very, very simple. "Strange features" like what? Ordering? Aggregation? Set intersection and exclusion?

4. I don't even understand this complaint.


I'm confused by how severely you've misrepresented my points. For example:

> If you think "hav[ing] to worry about stuff like locking and transactions" is an imposition, you don't want me to sit in your interview.

Why did you remove "at a fine grained level" in quoting me? I'm saying that people want the facilities that a database painlessly provides with respect to these things. What I'm saying is that programmers don't want to implement MVCC themselves, for example, or invent their own system for managing locks. These are things that are great about RDBMSes, but my point is that they could be done without SQL.

For another example:

>"Strange features"

That's not a thing I said at all. I said "a bunch of features". "with recursive" is an example of this. It's a great feature, but as the commenter at the top of this thread pointed out, using it is clunky and hard to read. I believe that this is in large part because how it had to be worked into an existing, weirdly designed language in a backward compatible way.

Edit to add: I don't understand why discussions about software engineering so often quickly turn into these attacks on people's competence. I might be wrong, and if so, you can convince me of that without raising your hackles with these aggressive statements about what you would or wouldn't do if this were a job interview. That kind of rhetoric is toxic to productive discussions.


I apologize if I've misrepresented your concerns. I'm operating from a perspective of a decade and change of doing this dance over and over:

  Engineer: "SQL is dumb and hard!"
  DBA: "What part?"
  Eng: describes problem
  DBA: describes misunderstanding
  Eng: "Oh! Oh, that's actually really simple! Thanks!"
It pretty much never goes the other way. So I'm probably a little over what read like dismissive, mis-premised, or under-informed criticisms. (And, yes, that probably colored the tone of my response. Again, apologies.)

> I believe that this is in large part because how it had to be worked into an existing, weirdly designed language in a backward compatible way.

Is sloppy shoe-horning the fault of the shoe, or the fault of the horn (assuming, for sake of discussion, that it's even sloppily done)? Recursively traversing parent-child (among other) relationships is not a wild, unforeseeable extension of set theory — and that's really all SQL is: a practical expression of set theory with a syntax that (admittedly) sometimes obscures that fact.

EDIT: Re: your edit. As one of my employer's DBAs, it's part of my job to reduce risk, including by passing on candidates I feel inadequately understand databases, or whose attitude evinces a lack of interest in improving that understanding. It's not about "attacking" a lack of competence, so much as avoiding the kind of incompetence that refuses to recognize itself.

If you've ever sat on the interviewer side of that table, you can't even pretend not to have seen entirely too much of that, and you'd pass on someone who didn't think they needed to understand the costs of various forms of, e.g., list traversal, just as quickly.


My haskell library beam[1] does just that. It'll catch scoping errors, aggregation errors, and more, through the type system. I'm implementing support for recursive queries right now!

Also, for the record, SQL is a strongly typed, declarative language, even if it doesn't seem that way.

[1] (http://tathougies.github.io/beam/)



I do think the biggest pro for CTEs is the increased readability, and using recursion inhibits the readability a bit (still better than the alternative). But I'm curious how complicated the SQL you're using is that it takes half an hour to understand what it's doing.


OP here, completely agreed. I have a separate post from some years back about CTEs specifically around readability. While they can make things a bit long and can be an optimization fence in Postgres the building blocks and readability in my eyes tends to win out over those things pretty often - http://www.craigkerstiens.com/2013/11/18/best-postgres-featu...


But can you solve the same task in a more straightforward way without CTE?


Yeah, use a graph database. Most trees end up becoming graphs as changes happen over time and the CTEs won't help much anymore.


Though I would propose that in the vast majority of applications, the amount you're going to need to do graph-like operations is significantly outweighed by the amount you're going to need to do regular relational-type operations. Being able to do them "well enough" without adding yet another moving part to your system (one which holds data that probably needs to be synchronized with other data stores) is extremely valuable.


Doesn't graph databases have similar query facilities? How does shifting to a graph database avoid the complexity of recursive queries?


The primary facility in graph databases is each node explicitly contains a list of relationships it has with other nodes. The relationships are treated as first-class citizens. Neo4j has a great set of docs that introduce Graph DBs to people already familiar with RDBMSes

[1] https://neo4j.com/developer/graph-db-vs-rdbms/


Another good resource is Designing Data-Intensive Applications [1]. Chapter 2 does a really good job explaining how different categories of databases relate to different data models, including examples of querying graph-like data models using `WITH RECURSIVE` compared to a query language for graph databases.

[1] https://www.amazon.com/Designing-Data-Intensive-Applications...


They're similar in complexity, but made easier by the fact that graph databases explicitly exist to query these types of relationships. Where someone may not know how to find their root manager node in an employee table, it would be an ideal thing to query for in a graph database. There are functions like 'shortestPath' in neo4j for example.


Dont worry. It takes the database 30 minutes to execute it too.


There's no particular reason that a recursive CTE should be that slow. Some are, but it's not too hard to write any kind of query to go as slow as you want to make it go.


Fun fact: Recursive CTEs are one of those things where, if you come in relatively new to them, would make you go "who would ever want to use this, and why?!"

Funner fact: Once you come across a problem to which the solution is to use a recursive CTE (handling train movement graphs in an RDBMS? yes please!), you'll find yourself having to explain your reasoning for it in a manner that will let you truly grok recursive CTEs.

Funnest fact: Both the above will almost certainly happen more than once in your data-wrangling lifetime.

(Or maybe that was just me...)


My order of discovery was more like this:

1) Recursive CTEs: that's nifty; I'll remember that for when I need it.

2) Here's a problem that involves chasing a linked list of rows, I know, I'll use a recursive CTE.

3) Oh dear, that doesn't perform well at all, I'm much better off denormalizing my lists into a separate column and using LIKE expressions, or something similar instead.


Do some database systems have support for optimizing recursive queries?


The best you can do is slap an index -- at the end of the day, as the parent said, you're traversing a linked list in random order which means that if your database index doesn't fit in RAM you'll be doing a random disk access per recursive step. Slow, but no slower than any other solution with that same problem.

If you want to optimize such a query, go back to the drawing board and see if your data truly warrants a recursive random walk.


It depends on the nature of the data. If it's hierarchical you can use ORDPATH[1], which results in an index scan. You can use ORDPATH to further solve other queries with a bit of creativity (such as this ancestor/descendant querying structure for directed cyclic graphs[2] that we needed).

My spidey senses lead me to believe that you won't get far with optimizing CTEs in the query engine, especially if the backing temp table becomes too large for memory.

[1]: http://www.dbis.informatik.hu-berlin.de/fileadmin/lectures/W...

[2]: https://github.com/k2workflow/Clay/blob/master/src/SourceCod...


We are using this kind of CTEs for handling BoM trees.

Everyone thinks we habe some complicated software that generates this tree. Well, we just maintain the correct parent-child relation. Rest is trivial, with recursive CTE.


Very cool, this was actually my first real-world use case for Postgres CTE's. (BOM part of homemade ERP software I built for my last company). Worked great!


>BoM trees

Bill of material trees? Could you maybe elaborate more?


Yes, Bill of Material. On our business we have sub-sub...-sub components, so we need to keep it like a tree, but manage it in a RMDBS.


One common use case I've found for recursive CTEs is generating tables of dates.

Very frequently, I have a question along the lines of "how many x have occurred per day/week/month?", where x has a timestamp column, where the data could be sparse.

One neat way to do this is to create a recursive CTE that begins by selecting the first relevant date, then unions that with the period (day/week/month) plus one up to the last date you're interested in.

Once done, you can left join your date CTE against a COUNT aggregation grouped by the period you care about on your data table.


> One common use case I've found for recursive CTEs is generating tables of dates.

If you don't have table-valued functions, just create a table with all the dates for the next hundred years or so. The number of distinct dates is pretty manageable. This is the approach I use when calculating stats data via MySQL.


And make that table WIIIIIIDE. There are so many useful attributes for query time and to be exposed as filters/parameters in reports for end users in a more self service mode.

Flags for whether the day part is YTD, MTD or similar, so you can easily select the set of days that would be in the YTD period for any year. This is super helpful for a multi-year comparison of YTD. So right now, everything from January 1- May 21/22 (depending on your edge case) would be YTD=1 for all years (not just 2018).

Contiguous, monotonically increasing indices for any date period - weeks, months, quarters, trimesters, semesters - especially helpful for fiscal calendars where the entirety of built-in date functions become worthless. Last period is always the index of the current period minus one. All date shifting logic becomes trivially expressible as basic arithmetic on these indices.

Pre-format dates into all the common display formats you might use. Just extra strings in the table - super useful for your non-technical report authors and self-service analytics consumers - don't worry about the cognitive overhead of teaching dozens-thousands of people how to use format strings effectively (they won't - they'll just pull it into Excel), when you can trade a tiny bit of storage overhead to give them all they want up front.

Give a simple "IsWorkDay" flag - include all your company holidays and weekends here. Also separately include an "IsWeekDay" flag and an "IsHolidayFlag". If you're concerned about storage, please count the number of dates - you need that many bits to store each of these flags. This is ~4.5KB per flag field on a 100-year date table. Even if you're storing things inefficiently as BIGINTs, it's ~292KB per flag field. This is pathetically small.

I really can't emphasize how useful this sort of thing is for fiscal calendars. A calendar is just a hierarchy whose base unit is a date.

Everyone's dates are the same. They just belong to different categories (Month1=January vs Month1="first four full weeks in the calendar year"). Those categories are defined as contiguous and non-overlapping sets of these base units. We care an awful lot about the inequality invariants of members of these sets (months, quarters, etc). All of this stuff can be exposed through a table that is updated for certain flags, or just a base table of static attributes and a view for the "dynamic" components.


I'd second this for dates. Every data warehouse I've been involved in building has always had a date table for ~100 years depending on the industry use case.


This is especially common in data warehouses, as the dimension tables need to be populated with all of the attributes that a user will be searching/drilling on (including dates). I don't think I've ever come across a transactional system with a dedicated date table -- you usually have access to a good suite of date-manipulation operations in SQL -- but maybe there's a case for it.


Why not just GROUP BY to_char(timestamp, some_format) / extract(something from timestamp)?


I assume because that doesn't give the 0 results in the query. The RCTE would.


Ah right. generate_series() might be another, possibly simpler way, to generate a time-series independent of what data in the relation actually exists and then join and group that series to the data relation.


If you're using postgres, take a look at the generate_series() set-returning function for this purpose. It might be a lot cheaper.

EDIT: Looks like someone already suggested this, down-thread.


I don't see a measurable performance difference on my machine. In my experience, set-based SQL is fast even on 100s of GB databases as long as everything is reasonably normalized and indexed.


It is, but by materializing the date series into a table, rather than "lazy-loading" it with the SRF, you're consuming disk IO that other backends could be using. Disk is the narrowest pipe in servicing user queries. That's the sense in which I was thinking it's cheaper.


An RCTE is not materialized into a table on disk. There is no disk IO.

I would also argue that network is the narrowest pipe. The sibling solution, where it is materialized into a table, is an excellent one. The amount of disk IO is negligible - the table is tiny.


Ack. I just realized I replied to you when I meant to reply to the sibling post, not yours. I'm sorry for the confusion. That solution uses a resolution of dates. If that meets your use-case, sure; if you need finer, the IO (and cache) burden grows with that.

Which is the narrowest pipe depends on your plumbing. My application-facing databases have 10 gigabit NICs, and narrower to the storage; I'm absolutely not network-constrained. I also don't really consider the size or shape of the data returned to the application a cost in the same sense I do the disk; application queries routinely denormalize the data into a less compact form. They presumably need the data in that format for a reason (even if that's simply the engineer's cognitive bandwidth).

It's also not a thing I'm generally in control of, so other than offering guidance, there's not a lot I can do about it.


assuming you're on postgres, generate_series() is a far easier option to do exactly what you've described.


I'm not sure how much easier it is - the CTE definition and the series definition are extremely similar. Certainly if you have generate_series() or something similar, it's an appropriate solution.


I thought I was crazy for implementing a dependency resolver in CTE but eventually the performance and database-encapsulation were serious winners.


Be careful of performance when using CTEs in Postgres: unlike in other DBMSs they are optimisation fences with regard to predicate pushdown so for some queries will result in extra scans for every level of call needed.

Doesn't affect all queries of course, and where is does the difference may not be significant compared to what else is going on (i.e. querying a small tree/graph structure to pull out some large/complex data), but it is something to watch out for when working with data of any appreciable size.


It's possible that this note at the bottom of the article was not there when you made your comment:

As a note CTEs at this time are an optimization fence in PostgreSQL, though there are hopes of that changing in the future. Common Table Expressions are an incredibly useful tool for reporting. At times the readability of CTEs outweighs the performance impact, but consider the trade-offs when using them


It is also possible that I skimmed over that part!


For the amount of data it processes it can be slow. I used it to render a category taxonomy of just maybe 100 nodes and it was a perceptible duration. The representation was so direct and clear that I kept it and just cached it server-side I validating went any category was changed. Also had to ensure there were no cycles in the data when saving changes.


For some kinds of data, such as taxonomies, there is a datatype called an ltree available in Postgres that can be indexed. Not as general as a CTE but useful for avoiding the optimization barrier in specific cases.


It's always to learn about new things in areas you already thought you'd covered. I'll definitely look this up.

Another common issue I've encountered is with representing reorderable items. I've usually just amortized a partial renumbering that works well enough but always wondered if there's a better way. Would storing balanced trees and using a recursive CTE work as well and be a bit simpler?


Re-orderable items are best represented using a rational.

There is a great blog post on the different approaches and trade-offs of user-orderable items[0].

[0] https://begriffs.com/posts/2018-03-20-user-defined-order.htm...


You can also mitigate this by judiciously ordering your CTEs, when you have more than one, having earlier ones do more of the winnowing work, and being used as inputs to later ones.

An often unrecognized consequence of CTEs being optimization fences is that they're much easier to farm out to background workers when parallel query execution is a thing.


What pattern ought one look for in the explained plan to identify this issue?

Is this something that can likely be improved, technically speaking?


Yes - most other DBMS don't do this. If you have a filter on a derived table that uses a CTE, Postgres will not be able to push it down to the CTE. If the CTE is expensive to calculate (a lot of rows, or joins, or complicated functions), but you don't need many rows from it in the final results, performance is likely to be very poor compared to rewriting as normal subqueries. You should be able to see this on the plan if you see an expensive CTE subquery with limited row results being used.

This is unfortunate as it can reduce readability significantly. There's resistance to fixing this as it is considered a breaking change apparently, which it may be for CTEs used for data manipulation (INSERT / UPDATE / DELETE can operate using CTEs), but it shouldn't be for plain SELECTs. however no obvious plans for this to change.

https://blog.2ndquadrant.com/postgresql-ctes-are-optimizatio... has more.


Is this something that would be evident upon running EXPLAIN over a query?


Excess index (or full table) scans on the recursively referenced tables, or "wrong" index choices otherwise on those objects, where your filtering/joining clauses would otherwise allow for more efficient options with the indexes that are available.

I'm not an expert on postgres (I spend most of my life in MS SQL Server's domain) so I'll not try be more detailed than that for fear of accidentally spreading/creating misinformation. Search for "postgress CTE optimisation fence explain" and you'll hopefully find some good examples as it is a commonly discussed topic once you know the right keywords to search for.


I once built a Reddit clone and used recursive ctes for the comment trees. Imagine each comment has a parent comment foreign key which might be null if at top level. If you use a recursive cte you can simply start with top level comments with null parents and recursively get all of the subtrees and end up with a nice flat list of all your comments in the order you want. The other way to do this is to just fetch all the comments and assemble the tree in memory and then flatten it, which is what I believe many sites do.


A third way to do this is described in https://en.wikipedia.org/wiki/Nested_set_model. Its performance for the common case of heavy reads and few writes is better than either of the two that you describe.


Common Table Expressions (CTEs) are sometimes referred to as WITH clauses. This article focuses on WITH RECURSIVE. They are useful ways to deal with tree oriented data.


Why use the word RECURSIVE at all? The engine should recognize the "recursiveness" of the CTE automatically, at the compile stage. Some other DB engines have had this for years.


I think it's for humans. Often humans looking through SQL code are doing so for performance reasons, and a recursive query is fundamentally different from a non-recursive query. It's something you might want to know while skimming, before diving deep.


I see where you are coming from but the idea does not resonate with me. A CTE is a CTE, recursive or not. Sometimes you do not care as long as it generates correct data.

I can see an analogue though: some languages differ Sub from Function although they are basically the same idea.

On the other hand, Postgres seems to be the only RDBMS that makes the RECURSIVE keyword mandatory for recursive CTEs.


It annoys me they have made backward incompatible changes to this in Postgres 10, such that recursive CTEs that worked perfectly fine in 9.6 suddenly get rejected in 10. See in particular the function in this StackOverflow answer – https://stackoverflow.com/a/46761197/2147204 – it works fine on 9.6, 10 rejects it with an error.


Nothing to do with CTEs. The queries were badly written to begin with and should have used LATERAL.


Yes, I forgot the actual issue was not CTEs directly.

I agree the query could have written better (I am still getting my head around how to use LATERAL), but it worked fine in 9.6 and stopped working in 10. From a backward compatibility viewpoint, code working in one version should still work in the next (even if it isn't the best code.) Or at least, start issuing deprecation warnings one version before making it not work.

Anyway, posting this to HN has triggered someone to go rewrite my code for me (thanks Ants Aasma, whoever you are), so now my Postgres 10 upgrade blocker is solved :)


You're welcome. Wanted to see how hard it is to port a query over.

Postgres generally tries its best to not break users code. However sometimes it is necessary for making forward progress. In this case the undocumented behavior of set returning functions within select list had some pretty funky, mostly accidental, semantics that were getting in the way of executor improvements. For example try to figure out how to explain the output of these two queries on 9.6:

    select generate_series(1,2), generate_series(1,4);
    select generate_series(1,3), generate_series(1,4);
That is one example of a silent behavior change between versions that was justified that applications that are seeing that behavior are probably broken anyway. Set returning functions within case expressions had more reasonable behavior so to avoid silent breakage they were made to result in an error.

Deprecation warnings are nice in theory, but in practice they would require an unreasonable amount of effort to properly implement, not seeing any warnings still wouldn't be a guarantee that your application works on new version. And it seems most users ignore deprecation warnings anyway. Besides, it's not like you can avoid making the changes, you just have slightly less schedule flexibility on when to implement them.


I never cease to be amazed at how well Postgres maintains decades-old ACID SQL standards while churning out these innovative advances to the spec.


Postgres is great, but I don't think there is a serious rdms that doesn't have recursive ctes.


Does Oracle have them now? There was a time when it had some janky CONNECT BY syntax of its own.


Oracle >= 10 have it. CTEs are inside of SQL standard.


I'm kinda trying to figure out why this is on HN today. Usually articles, even reposts, contain some sort of novel content or ideas. Recursive CTEs in SQL databases are well known and have been for years.


How do you pass parameters to the recursive function?


I wouldn't call it a function as much as a temporary table defined in terms of itself. It is parametrised the way all CTEs are, by giving arguments in parentheses postfix.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: