This is a strange critique. His recommendation isn't a soundbite. It is: ask que...

bhntr3 · on Sept 11, 2015

Old data is possibly uninteresting. That's debatable. But all uncollected data is freshly uncollected data.

I strongly believe in collecting as much granular data as possible. It's totally possible to throw it away after it has outlived its likely usefulness.

But if I want to calculate a new machine learning signal today , I can't wait six months to accumulate enough data. I want that data to exist. And the only solution is to overcollect. And to collect at a granular level so I can aggregate and transform later.

The liability and security concerns are real. The fact that most companies are stupid about investing in unnecessary big data infrastructure is real too. But a recommendation to spool to cheap offline or nearline storage immediately is interesting. A recommendation to throw data away seems like folly to me.

oconnore · on Sept 11, 2015

If your operation is the sort that capitalizes on machine learning on subtle signals, you capture the sorts of data that you know might be beneficial 6 months down the road. Although I suspect that you overstate the sort of "surprise questions" that might come up (and also the granularity necessary to answer them), that is a correct response to a business model that demands those insights at that granularity.

That doesn't mean that "collect everything at max granularity" is good advice, because as you said:

> The fact that most companies are stupid about investing in unnecessary big data infrastructure is real

yummyfajitas · on Sept 11, 2015

The fact of the matter is that I've never been unhappy about overcollecting data. Worst case, step 1 of my pipeline is 10x or 50x slower than it needs to be due to filtering out a bunch of junk. The added latency to my workflow might be a few minutes.

Every time I've undercollected I've been unhappy, and this was hardly a rare occurrence. I need to build the collector, deploy it, and wait for data to flow in. Added latency = 1 week, minimum.

You can always throw useless stale data away. You can never retroactively collect data you needed.

crdb · on Sept 11, 2015

Here's a simple counter-example. You're an e-commerce company, and in year 1, you can choose which js events/hits to track. For the sake of simplicity (and perhaps because prompted to do so by the Google Analytics tutorial) you only track product views (i.e. loading a product page) and conversions.

In year 5, you now process 5,000 or 50,000 orders a day, and you're wondering what the click through rate of your products is when they come up in a search. That's your "question", which will help you figure out which 100 of the 100,000 products you stock your customer will be interested in (because it's 10x as much data as conversion rate).

Guess what, those who installed Piwik and tracked the "impression" event/hit can now immediately play with it. You on the other hand have to start tracking it now and just missed on 5 years of data to explore which brands your customers like for example.

It wouldn't have cost you much to track everything - maybe $1-10k/year for an AWS server to host the Piwik database (it's a bit costlier if you're with Google - $150,000/year for Google Analytics Premium + $15,000/year for BigQuery to be able to query the hit IDs, and only starts tracking on the day you activate it).

rcthompson · on Sept 10, 2015

There's a difference between being indecisive about what data or questions you care about now and being unsure about which data/questions you will care about in the future. If your data needs might change in the future, then there is an argument to be made in favor of saving data that has no current apparent value, and this must be weighed along with everything else when deciding what data to keep. Sometimes data can suggest new questions, and sometimes it is worth collecting data purely in the hope that it will generate new questions.

As an example from my research area, the human genome was not sequenced to answer any one specific biological question; it was sequenced because without it, we would not even be able to ask the kind of questions we wanted to, much less answer them.

Of course, that's a research context. In a business context, especially in a well-established industry, the types of data that you need are likely to be well-understood and exploratory analysis is probably a lot less important.

collyw · on Sept 11, 2015

Genomics is an area where data is thrown away all the time. The images that come from Illumina sequencing machines usually gets processed onece then disarded.

rcthompson · on Sept 11, 2015

We throw away the images because we're quite sure at this point that we're extracting all the useful information (the DNA sequences) that we can from them. This was not true in the early days of Illumina sequencing, when it was not uncommon the save the images and run an alternative base caller on them to try and get improved sequences when the standard base caller failed.

eanzenberg · on Sept 10, 2015

Any hypothesis testing requires historical testing. It's easy to fit trends to anything and harder to predict and project. Asking questions first then collecting data sounds nice but it's movie-style sleuthing, unless you want to wait the time it takes to collect enough data to test your model.