> My point here is more: if you plan on using the data, it becomes more and more expensive as your access patterns change.
I don't think this is true either. Google does not need to keep much of its data in hot storage to use it effectively: their ML products can be periodically trained / updated, their search can be iteratively updated with each crawl, etc. Sure, it would be expensive to keep all user data from all sources in hot storage all the time - but it's not needed. The idea that you...would happen upon some new question you hadn't though of before and need to get the answer immediately is just false. Instead, you make regular updates to a model and periodically run your corpus through that model.
I don't think this is true either. Google does not need to keep much of its data in hot storage to use it effectively: their ML products can be periodically trained / updated, their search can be iteratively updated with each crawl, etc. Sure, it would be expensive to keep all user data from all sources in hot storage all the time - but it's not needed. The idea that you...would happen upon some new question you hadn't though of before and need to get the answer immediately is just false. Instead, you make regular updates to a model and periodically run your corpus through that model.