I'm working on a partition-oriented declarative data build system. The inspiration comes from working with systems like Airflow and AWS step functions, where data orchestration is described explicitly, and the dependency relationships between input and produced data partitions is complex. Put simply, writing orchestration code for this case sucks - the goal of the project is to enable whole data platforms to be made up of jobs that declare their input and output partition deps, so that they can be automatically fulfilled, enabling kubernetes-like continuous reconciliation of desired partitions.
This means, instead of the answer to "how do we produce this output data" being "trigger and pray everything upstream is still working", we can answer with "the system was asked to produce this output data partition and its dependencies were automatically built for it". My hope is that this allows the interface with the system to instead be continuously telling it what partitions we want to exist, and letting it figure out the rest, instead of the byzantine DAGs that get built in airflow/etc.
This comes out of a big feeling that even more recent orchestrators like Prefect, Dagster, etc are still solving the wrong problem, and not internalizing the right complexity.
Very much agree that to this is the direction data orchestration platforms should go towards - the basic DAG creation can be straightforward, depending on how you do the authoring - (parsing SQL is always the wrong answer, but is tempting) - but backfills, code updates, etc are when it starts to get spicy.
I think this is where it gets interesting. With partition dependency propagation, backfills are just “hey this range of partitions should exist”. Or, your “wants” partitions are probably still active, and you can just taint the existing partitions. This invalidates the existing partitions, so the wants trigger builds again, and existing consumers don’t see the tainted partitions as live. I think things actually get a lot simpler when you stop trying to reason about those data relationships manually!
This is true, but you can get combinatorial complexity explosions, especially with the data modeling patterns for efficiency common at some companies - eg a mix of latest dimensions and historical snapshots, without always having clear delineations about when you're using what. Common example is something like a recursive incremental table that needs to be rebuilt from the first partition seed. Some SQL operations can also be very opaque (syntactically, or in terms of special DB features) as to what partitions are being referenced, especially again when aggregates get involved.
It's absolutely solvable if you're building clean; retrofitting onto existing dataflow is when things get messy, and then managing user/customer expectations of a more strict system. People like to be able to do wild things!
This means, instead of the answer to "how do we produce this output data" being "trigger and pray everything upstream is still working", we can answer with "the system was asked to produce this output data partition and its dependencies were automatically built for it". My hope is that this allows the interface with the system to instead be continuously telling it what partitions we want to exist, and letting it figure out the rest, instead of the byzantine DAGs that get built in airflow/etc.
This comes out of a big feeling that even more recent orchestrators like Prefect, Dagster, etc are still solving the wrong problem, and not internalizing the right complexity.