Intriguing, but I think I lost you a little in the details and really dying to u...

runT1ME · on Dec 6, 2021

Let's say your immutable DAG has four different 'transformations' (think either a map or a fold on the data). Beam or Spark will take a chunk of your data, partition it somehow (so you get parallelism), and do a transform on the data.

Now if you're doing a bunch of computations off of tens or hundreds machines you don't want to fail the whole thing because one hypervisor crashes or there's a JVM segfault or whatnot. So each step is usually saved to disk via checkpoint and then moved along to the next transformation in batches, so if the subsequent transfomation is where it dies, you have a safe spot to restart that particular computation from.

In addition, your (chunk of) data might need to be shipped to an additional machine to be transformed so you're not so much 'updating' data as you are creating new data in a space efficient way of the changes and shuffling those along the various steps.

chrisjc · on Dec 6, 2021

Ok, makes sense now. Thanks for taking the time to explain.