Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm curious how QuestDB handles dimensions. OLAP support with reasonably large number of dimensions and cardinality in the range of at least thousands is a must for modern-day time series database. Otherwise, what we get is only incremental improvement to Graphite -- a darling among startups, I understand, but a non-scalable extremely hard to use timeseries database nonetheless.

A common flaw I see in many time-series DBs is that they store one time series per combination of dimensions. As a result, any aggregation will result in scanning of potentially millions of time series. If any time-series DB claims that it is backed up by a key-value store, say, Cassandra, then the DB will have the aforementioned issue. For instance, Uber's M3 used to be backed up by Cassandra, and therefore would give this mysterious warning that an aggregation function exceeded the quota of 10,000 time series, even though from user's point of view the function dealt with a single time series with a number of dimensions.



FYI M3 is now backed by M3DB, a distributed quorum read/write replicated time-series based columnar store specialized for realtime metrics. You can associate multiple values/timeseries with a single set of dimensions if you use Protobuf's to write data, for more see the storage engine documentation[0]. The current recommendation is not to limit your queries but limit the global data queried per second[1] by a single DB node by using a limit on the number of datapoints (inferred by blocks of datapoints per series). M3DB also uses an inverted index using FST segments that are mmap'd[2] similar to Apache Lucene and Elastic Search to make multi-dimensional searches on very large data sets fast (hundreds of trillions of datapoints, petabytes of data) which is a bit different to traditional columnar databases which focus on column stores and rarely is accompanied by a full text search inverted index.

[0]: https://docs.m3db.io/m3db/architecture/engine/

[1]: https://docs.m3db.io/operational_guide/resource_limits/

[2]: https://fosdem.org/2020/schedule/event/m3db/, https://fosdem.org/2020/schedule/event/m3db/attachments/audi... (PDF)


Recommended reading on FST for the curious: https://blog.burntsushi.net/transducers/


Thank you for mentioning that, Andrew's post is really fantastic covering many things altogether: fundamentals, data structure, real world impact and examples.


I love FST's and similar structures, they're just such a cool idea.

Anyone know if there are any other similar interesting blog posts/articles about FST's?


Thanks, @roskilli! Nice documentation.


We store "dimensions" as table columns with no artificial limits on column count. If you able to send all dimensions in the same message, they will be stored on one row of data. If dimensions are sent as separate messages, current implementation will store them on different rows. This will make columns sparse. We can change that if need be and "update" the same row as dimensions arrive as long as they have the same timestamp value.

There is an option to store set of dimensions separately as asof/splice join separate tables.


Thanks for the explanation.


Can you handle multiple time dimensions efficiently? We have 3 of them, can one get away without having to physically store "slices" on one of them?


if you can send all three in the same message, Influx Line Protocol for example, we will store them as 3 columns in one table. Does this help?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: