The fact of the matter is that I've never been unhappy about overcollecting data. Worst case, step 1 of my pipeline is 10x or 50x slower than it needs to be due to filtering out a bunch of junk. The added latency to my workflow might be a few minutes.
Every time I've undercollected I've been unhappy, and this was hardly a rare occurrence. I need to build the collector, deploy it, and wait for data to flow in. Added latency = 1 week, minimum.
You can always throw useless stale data away. You can never retroactively collect data you needed.
Every time I've undercollected I've been unhappy, and this was hardly a rare occurrence. I need to build the collector, deploy it, and wait for data to flow in. Added latency = 1 week, minimum.
You can always throw useless stale data away. You can never retroactively collect data you needed.