> This query takes about 50ms! We experimented with returning the document text with results and found that if we returned the full text of documents, it added about 350ms to the query time which seems to be network overhead rather than a slower query. If we only returned 200 characters from the texts then it only added around 100ms.
The way you have written the query, it has to extract the title from all of the JSON documents before applying the LIMIT.
Perhaps do the work to get the identifiers, order by rank, and apply the limit... all in a subquery. So that only the outer query actually touches the JSON document and extracts the titles.
Thanks for the feedback! I tried out your suggestion by running the following:
SELECT id, meta->>'title' as title, meta FROM(
SELECT id, meta FROM (
SELECT id, meta, tsv
FROM data_rows, plainto_tsquery('search query') AS q
WHERE (tsv @@ q)
) AS t1 ORDER BY ts_rank_cd(t1.tsv, plainto_tsquery('search query')) DESC LIMIT 5
) AS t2;
Please let me know if this isn't what you meant. The above did improve the speed but only a little. I think for our use case this approach wouldn't be entirely necessary but maybe if we had a larger data set then it'd be worth implementing.
I hate writing SQL blind, but more like this... note that you fetch only the identifiers with the inner query, and use them in the outer query to retrieve the documents.
SELECT d.id
,d.meta->>'title' as title
,d.meta
FROM (
-- Only select the identifier column
SELECT id
FROM data_rows
,plainto_tsquery('search query') AS q
WHERE tsv @@ q
ORDER BY ts_rank_cd(tsv, q) DESC
LIMIT 5
) AS ids
-- Join back onto the data_rows, this fetches the documents
INNER JOIN data_rows d ON d.id = ids.id;
You'll want to go through the explain plan, but treat identifier retrieval as a step by itself. Things like ts_headline (not shown), or meta->>'title' (shown) should only be performed for the few rows you actually are retrieving and you should make the data you pass around be as small as possible at all points.
Note that you don't need to call plainto_tsquery twice, you can use the one you've already declared.
And I notice you are returning the JSON document whole, have you checked whether it's faster to extract the title here or later in whatever your programming language is?
The giveaway clue in your post btw is this: "We experimented with returning the document text with results and found that if we returned the full text of documents, it added about 350ms to the query time which seems to be network overhead rather than a slower query.".
As soon as you notice a linear slowdown, when you're only returning 5 rows whatever happens... then internally you must be doing something in the query that relates to all rows.
PostgreSQL fulltext scales nicely, I use it on a platform driving over 300 web sites with 10s of millions of comments. I originally used it thinking it will buy me time before I have to add a dedicated search engine, and it helps now as I store permissions knowledge in the database too... but truth is, it's just not been an issue yet. I'm not needing to consider anything else as it's performing great. The slowest part of my query isn't fulltext search, it's my permissions stuff (because I've gone down the rabbit hole of users, groups, roles, etc and this is calculated for all content), but still the 95th percentile for searches is typically under 10ms.
Just wanted to say, I implemented this change and it did really speed things up! Thanks a bunch :) This is the query I went with:
SELECT d.id, d.meta FROM (
SELECT id FROM data_rows, to_tsquery('query') AS q
WHERE tsv @@ q ORDER BY ts_rank_cd(tsv, q) DESC LIMIT 5
) AS ids INNER JOIN data_rows d ON d.id = ids.id;
The way you have written the query, it has to extract the title from all of the JSON documents before applying the LIMIT.
Perhaps do the work to get the identifiers, order by rank, and apply the limit... all in a subquery. So that only the outer query actually touches the JSON document and extracts the titles.
Try it and report back on the speed improvement.