Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Secondly, almost all data is useless in its raw form. The analysts had to perform ELT jobs on their data in the warehouse to clean, dedupe, aggregate, and project their business rules on that data. These functions often require the database to scan over historical data to produce the new materializations of that data.

The point of Materialize, from my understanding, is that you don't put things into the data warehouse and then, as a separate step, run these enrichment/reporting jobs on it.

Instead, you register persistent, stateful enrichment "streaming jobs" (i.e. incrementally-materialized views) into the data warehouse; and then, when data comes into a table upstream of these views, it gets streamed into and through the related job to incrementally populate the matview.

I believe you end up with a consistent MVCC snapshot between the source data and its dependent inc-matviews, where you can never see new source data and old derived data in the same query; sort of as if the inc-matview were being updated by a synchronous ON INSERT trigger on the original raw-data ingest transaction. (Or—closer analogy—sort of like the raw table has a computed-expression index visible as a standalone relation.)

Also, the job itself is written as a regular OLAP query over a fixed-snapshot dataset, as a data-scientist would expect; but this gets compiled by the query planner into a continuous CTE/window query sorta thing, that spits out running totals/averages/etc. as it gets fed incremental row-batches—but only at the granularity that consumers are demanding from it, either by directly querying the inc-matview, or by sourcing it in their own streaming jobs. Which, of course, you do by just using a regular OLAP query on inc-matview X in your definition of inc-matview Y.

> I would love to hear the use cases that your customers have that made Materialize a good fit!

We haven't used Materialize yet, but it's a very promising fit for our use-case: high-level financial-instrument analytics on blockchain data. We need to take realtime-ish batches of raw transaction data and extract, in est, arbitrarily-formatted logs from arbitrary program-runs embedded in the transaction data (using heavy OLAP queries); recognize patterns in those that correspond to abstract financial events (more heavy OLAP queries); enrich those pattern-events with things like fiscal unit conversions; and then emit the enriched events as, among other things, realtime notifications, while also ending up with a data warehouse populated with both the high- and low-level events, amenable to arbitrary further OLAP querying by data scientists.



> Instead, you register persistent, stateful enrichment "streaming jobs" (i.e. incrementally-materialized views) into the data warehouse; and then, when data comes into a table upstream of these views, it gets streamed into and through the related job to incrementally populate the matview.

This is correct. There is an example in the docs of dumping json records into a source and then using materialized views to normalize and query them - https://materialize.io/docs/demos/microservice/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: