He's on to something here, but I think the asset/liability duality isn't a matter of your point of view. Code really is a liability, even from the high-level view in the boardroom. The asset is functionality. If you can get more functionality with less code, you improve your balance sheet.
By analogy, data is also a liability. The asset is insight.
This strikes me as some sort of Orwellian contortion. It really is a matter of your point of view. Insofar as code is correlated with functionality (and it is at least to some degree), code can be viewed as an asset. In addition, code can be sold for cash, and this makes it an asset from the strictly financial point of view. It may have maintenance costs associated with it -- and these certainly need to be taken into account -- but that is not the same as being a liability.
Unless you've heard of places paying per line of code you seem to have missed the point of code being a liability. It's what it does that's valuable, not the volume of it
that's no more true of code than of land or anything else; thats just the fundamental idea of what value of a thing is, what you can use it to do or get.
So it's odd to say that value doesn't lie in code but in the functionality that code provides, without also saying that of everything else, and, to say it at all -- of code or more generally -- is just to use language in an atypical way (and,if its just of code, an inconsistent way as well.)
But the point is that you pay for the code in order to gain the functionality. Code is an expense. The application is the asset. If you want to add a feature to the application then you have to pay for it in developer time. The larger and more complicated the supporting code, the more time you will have to pay for.
I pay for X because X provides valuable functionality is generally true of productive assets. And it's not unusual for productive assets to depreciate or have maintenance expenses that must be born to maintain their functionality -- usually both.
So, while those are all valid observations about code, they don't make it a liability, they make it a fairly bog-standard productive asset.
So given the choice between having feature X written in 1,000 lines or 1,000,000 lines of code, you don't care which is which? I find that hard to believe.
If code truly was an asset, it wouldn't really matter if it was 1000 lines or a 1,000,000 lines as they're functionally equivalent right? Or perhaps the million line program would be worth more, wouldn't it? Code is an asset.
But the amount of code directly influences the maintenance cost, doesn't it? And maintenance cost is a liability, no? It's something that you're obligated to pay if you want to keep the asset productive. And an obligation to pay is a debt, isn't it?
Yes this is all stretching a bit, but it's not entirely wrong either.
You have asserted your statement to be true. OK. Why?
Here's my point. I can buy two factories to make cars. One costs $100mm and $100k/mo in maintenance to run (fixing machines, etc). The other costs $100mm and $10mm/mo to run. They make identical output, the exact same cars at the exact same rate with the exact same inputs and costs. Which factory should I buy? And why?
> Here's my point. I can buy two factories to make cars. One costs $100mm and $100k/mo in maintenance to run (fixing machines, etc). The other costs $100mm and $10mm/mo to run. They make identical output, the exact same cars at the exact same rate with the exact same inputs and costs. Which factory should I buy? And why
Actually, that analogy reveals the silliness of the assertion that code is not an asset. Because its equivalent to asserting that factories -- the canonical example of a capital asset -- are not assets.
That are code base with greater maintenance requirements with equivalent functionality to one with lesser maintenance requirements may be analogous to a higher-maintenance factory with similar functionality to lower-maintenance one (which isn't really controversial) doesn't support the idea that code is not an asset, just that maintenance costs is an important consideration in the net value of a productive asset, which, again, is common across all classes of assets.
That LOC might be a metric associated with maintenance costs of code doesn't stop code from being an asset doesn't stop code from being an asset any more than the fact that part county might be a metric associated with maintenance costs of industrial machines stops such machines from being assets.
Are we arguing over the true definition of the terms asset and liability or over the point the OP was trying to make when using these terms as a metaphor?
Hard to tell. The whole thing gets pretty wacky pretty fast. I think that metaphysically features are assets and code is a liability. Similarly production capacity is an asset and the machine that comprise that production capacity are liabilities.
So what does that mean? You end up with a (probably) tangible thing that has assets and liabilities all intertwined in it. Mechanically these tend to be inseparable. But when it comes to code, it's often possible to drastically reduce complexity (liabilities) while maintaining the features (production capacity) and so trying to value it exactly the same as a factory doesn't make a ton of sense.
For example in 2000 you might have spent $2mm making a big CRUD app that runs your business. And you might still value it at $2mm even though today a programmer or two, a year, and a modern web framework might remake it for $300k. And it might go from 100kLOC of "in house" code down to 20kLOC because the framework does most of the heavy lifting.
You don't often see that kind of thing happen in the physical world which is why people bundle everything up, but when talking about code it's easier to see how to split apart the value production from the value consumption to where a person might talk about code as a liability and the things that are automated as the asset.
No, the person's salary who made the code is the expense. The code is intellectual property that can be considered an asset, although without a pricing mechanism from a marketplace, it can be difficult to say exactly how much it is worth.
I did not mean to imply that the volume of code was directly tied to its value. I understand that value stems from functionality and not from quantity of code. But functionality does stem from code, as jsprogrammer said. If your code does X, and X is valuable, then the code is valuable -- I don't see what's wrong with this line of reasoning.
An airliner certainly is an asset (at least in the hand of a well run airline), but the lighter the frame the better: a lighter frame needs less fuel to fly, hence more money remains in the bank at the end. But without frame there's no plane, so you need some of it, just as little as feasible.
It's the same for code: a running program can certainly be a huge asset, but the less code you have the lower the maintenance cost. The concept is just harder to grasp than paying a kerosene bill each time a plane takes off...
I totally agree with your sentiment but I do want to add that adding code can be useful and make it better even if you're not adding functionality. For example, while it may be "cool" to turn a straightforward loop into a complicated one-liner (which would be "removing code while retaining functionality") that wouldn't necessarily be a good thing. The longer, more straightforward code is likely to have a lower maintenance cost.
I think the issue is that you are conflating lines of code with complexity. I certainly think it is better to write short programs, but it is wrong to think that less LOC == better code.
The most valuable kind of software IMO is the kind you don't think about because it never fails, and at that point your LOC measurements are irrelevant.
Depends if it adds anything else - readability, maintainability, flexibility, portability etc.
The "code as liability" argument seems, unless I'm mistaken, to be based on the idea that less code for the same level of functionality means less cruft, and therefore better support for the NFRs that I mentioned above.
That's often true, but not always.
If you're including those NFRs in what you mean by same functionality, then the number of lines of code is utterly irrelevant.
From an accounting perspective I am inclined to disagree. Firstly, the way you present functionality as asset and code as liability, you do not make it easy for me to read how they are components of a balance sheet. More specifically, how do you see your balance sheet improving while still being balanced?
Furthermore, both assets and liabilities (including capital) are stock measures. I would consider insight more of a flow than a stock. More specifically, I think it is the process and tools of achieving insights that is the asset. Whether data is part of this asset base is up for grabs, but accountants have not identified any reasonable way to measure it.
"...the light that falls on to your eye, sensory information, is meaningless, because it could mean literally anything. And what's true for sensory information is true for information generally. There is no inherent meaning in information. It's what we do with that information that matters." Beau Lotto
I have constructed the argument before that if you consider functionality as an asset and code as a liability, you can consider a refactor as retained earnings (or stockholder's/owner's equity, whichever you prefer). Along those lines, refactoring code becomes much more palatable.
I know it's a flawed analogy because refactoring costs money, so it isn't really RE. It did help me answer a question along a similar vein to yours ("...how do you see your balance sheet improving while still being balanced?"); where I was assuming a reduction in liabilities satisfied "...balance sheet improving...".
As far as the balance sheet, less code means lower costs and more projects from development. Because it costs less to maintain less code than it does large, convoluted codebases. Are you better off with a codebase that has similar value (functionality) but much lower maintenance costs? The value of a codebase is the functionality it provides, not the number of lines of code in use.
Instinctively management will try to through more resources at a problem instead of paying off technical debt before it becomes unwieldy.
I don't know if I understand properly, but are the dimensions not implicitly at least defined if we have n-dimensionally measured point? Any additional dimension would be either irrelevant or would be necessitating additional data to populate the dimension
This is a common distinction. The issue is that my information may be your data and therefore any such definition does not refer to data or information but to someone's relationship with it. I allude to that with the quote I included above.
Data: Any collection of information, bullshit and noise.
Information: Data that has been judged to be true or false through comparison with observed reality at a given point in time.
Bullshit: (1) Data (often a firehose) produced by someone who is indifferent to the truth or falsity of what is being said (2) Data that appears to contain more information than it actually does (3) Noise randomly tagged with truth-values to give it apparent legibility
Noise: The component of data that is neither information nor bullshit and at risk of being prematurely discarded ...
Falsehood: Information that is known to be inconsistent with the observed state of the world.
Truth: Information that is known to be consistent with observed reality at a given time and capable of being unpredictably turned into falsehood by a change in reality in the future.
Illegibility: Information or bullshit that looks like noise to those outside a system, due to the presence of a large amount of metis. (Derived from James Scott).
Metis: A collection of formulas that work to maintain the identity of a system.
Intelligence: The ability to separate bullshit from information.
Belief: Sincere acceptance of the truth or falsity of a piece of information.
Art: the process of creating information starting with bullshit.
I like the definitions, but I can't be the only who notices that according to these definitions data is "information and ...", while information is "data that ...". It looks like a circular definition, which definitely cannot help with distinction.
> Code really is a liability, even from the high-level view in the boardroom. The asset is functionality.
The programmers who understand and maintain the code are the real assets; everything else is a liability. Without those programmers, those programs will quickly rot because change is always happening and sticking a random programmer in there will over time degrade the program.
Smart companies know this, it's what talent acquisitions are all about. You want the goose that lays the golden egg, not the egg. The egg has some short term value, but it depreciates quickly, the goose is the real asset.
In an ideal world as specified by the 1900 US Congress, even Disney's intellectual property would depreciate over time, and the assets would be the animators who create new intellectual property.
...or their workforce would merely have a lot more animators, musicians, producers, etc and would necessarily produce more content more regularly than simply sitting on a mountain of IP and merely re-marketing it every few years with lots of lawsuits and lobbying mixed in to "protect" said mountain of IP.
They would also be able to re-use and remix the content of others (e.g. competitors) more regularly and the public at large would be able to do the same. Resulting in more content for everyone to enjoy.
Please see: context, we are obviously not talking about accounting nor using the words asset and liability in the accounting sense, nor was the article.
I think the better analogy is inventory, which is an asset. The value of data depreciates over time. Having too much data that is not producing insight or value is inefficient asset allocation.
In other words, lean principles should be applied to data. Reduce data collection that doesn't add value. Calling data a liability is not quite the right analogy for me.
The problem with this is that while the value of data may depreciate over time the cost to store it also costs less over time due to Moore's law and Kryder's law.
The depreciation isn't the important part. The article suggests considering data as a liability in the accounting sense. But the author should have stopping at saying it's a liability in the non-accounting definition.
Assets have value, but there is risk associated with it. Assets like inventory can lose value because of depreciation, or because demand for the finished good has dried up, there's a risk of theft, there's a cost of storage, etc.
Data should be viewed as an asset like inventory and lean principles can be used to reduce your risk and produce customer value efficiently.
The reason to collect everything (or rather, more things than you think you'll need to answer the questions you think you'll need to answer) is that…you could be wrong about what data are required to answer the questions you've identified, and that you could be wrong about which questions you will care about.
And historical data can be extremely useful, e.g. when looking at seasonal or cyclical trends (woe betide the grocer who doesn't stock up on turkeys in mid November, which means woe betide the grocer who doesn't order turkeys in the summer).
Yes, he's absolutely right that data management and privacy impose a cost: the cost of storing a GB of data is more than S3's 10¢. It takes a human being to make a judgement call about whether this data is likely to be worth its overall cost and risk. That's why managers and other decision-makers get paid what they get paid. 'Data is a liability' is a nice soundbite, but it doesn't capture the full reality; one can't manage to soundbites.
This is a strange critique. His recommendation isn't a soundbite. It is: ask questions, figure out what you need to know to answer those questions, and then collect that data.
That (asking questions and making informed decisions) sounds like exactly what those "managers and decision makers" ought to be doing in the first place. To collect everything because you can't decide on a strategy is exactly the opposite.
(Re: turkeys and fairly complex retail sales cycles. A year or three of aggregated purchase history is likely to yield actionable insight. It's unlikely that a slight upward trend in turkey purchases since 2006 is very actionable.)
Old data is possibly uninteresting. That's debatable. But all uncollected data is freshly uncollected data.
I strongly believe in collecting as much granular data as possible. It's totally possible to throw it away after it has outlived its likely usefulness.
But if I want to calculate a new machine learning signal today , I can't wait six months to accumulate enough data. I want that data to exist. And the only solution is to overcollect. And to collect at a granular level so I can aggregate and transform later.
The liability and security concerns are real. The fact that most companies are stupid about investing in unnecessary big data infrastructure is real too. But a recommendation to spool to cheap offline or nearline storage immediately is interesting. A recommendation to throw data away seems like folly to me.
If your operation is the sort that capitalizes on machine learning on subtle signals, you capture the sorts of data that you know might be beneficial 6 months down the road. Although I suspect that you overstate the sort of "surprise questions" that might come up (and also the granularity necessary to answer them), that is a correct response to a business model that demands those insights at that granularity.
That doesn't mean that "collect everything at max granularity" is good advice, because as you said:
> The fact that most companies are stupid about investing in unnecessary big data infrastructure is real
The fact of the matter is that I've never been unhappy about overcollecting data. Worst case, step 1 of my pipeline is 10x or 50x slower than it needs to be due to filtering out a bunch of junk. The added latency to my workflow might be a few minutes.
Every time I've undercollected I've been unhappy, and this was hardly a rare occurrence. I need to build the collector, deploy it, and wait for data to flow in. Added latency = 1 week, minimum.
You can always throw useless stale data away. You can never retroactively collect data you needed.
Here's a simple counter-example. You're an e-commerce company, and in year 1, you can choose which js events/hits to track. For the sake of simplicity (and perhaps because prompted to do so by the Google Analytics tutorial) you only track product views (i.e. loading a product page) and conversions.
In year 5, you now process 5,000 or 50,000 orders a day, and you're wondering what the click through rate of your products is when they come up in a search. That's your "question", which will help you figure out which 100 of the 100,000 products you stock your customer will be interested in (because it's 10x as much data as conversion rate).
Guess what, those who installed Piwik and tracked the "impression" event/hit can now immediately play with it. You on the other hand have to start tracking it now and just missed on 5 years of data to explore which brands your customers like for example.
It wouldn't have cost you much to track everything - maybe $1-10k/year for an AWS server to host the Piwik database (it's a bit costlier if you're with Google - $150,000/year for Google Analytics Premium + $15,000/year for BigQuery to be able to query the hit IDs, and only starts tracking on the day you activate it).
There's a difference between being indecisive about what data or questions you care about now and being unsure about which data/questions you will care about in the future. If your data needs might change in the future, then there is an argument to be made in favor of saving data that has no current apparent value, and this must be weighed along with everything else when deciding what data to keep. Sometimes data can suggest new questions, and sometimes it is worth collecting data purely in the hope that it will generate new questions.
As an example from my research area, the human genome was not sequenced to answer any one specific biological question; it was sequenced because without it, we would not even be able to ask the kind of questions we wanted to, much less answer them.
Of course, that's a research context. In a business context, especially in a well-established industry, the types of data that you need are likely to be well-understood and exploratory analysis is probably a lot less important.
Genomics is an area where data is thrown away all the time. The images that come from Illumina sequencing machines usually gets processed onece then disarded.
We throw away the images because we're quite sure at this point that we're extracting all the useful information (the DNA sequences) that we can from them. This was not true in the early days of Illumina sequencing, when it was not uncommon the save the images and run an alternative base caller on them to try and get improved sequences when the standard base caller failed.
Any hypothesis testing requires historical testing. It's easy to fit trends to anything and harder to predict and project. Asking questions first then collecting data sounds nice but it's movie-style sleuthing, unless you want to wait the time it takes to collect enough data to test your model.
Collecting everything would never be the optimum strategy and apparently the least meaningful one economically.
For the unexpected extra values from history data, the possibility that the return value is greater than the associated total cost doesn't necessarily increases with the dataset size. If we look at the trends for history things, only those with scant availability and high quality will be of real values.
I agree with the author that a better strategy would be to start with some unique and useful questions, store data for a purpose (insight).
Asset or liability? It is not a black or white thing, it is a scale depends on the information value and fidelity. In information theories, the information quantity (entropy or bits) is inverse proportion to the available possibility.
Same is true of my kids schools. Every year, we have to fill out (on paper) the same extensive forms to collect parent names and addresses, phone, email, whether you are homeless, what your racial makeup is, on and on.
No option for "everything is the same as last year" and they completely flip out if you don't return the forms.
This article is highly misleading. Sure, certain kinds of data (like credit card data or medical data) can be a liability if not properly managed. If you are working in one of those industries you already know that (although arguably many financial and medical companies underestimate the risks).
In contrast, data about what users have clicked, comments on an online forum, or who has friended who on your social media site is not a liability. In many cases this data is already public. Even in cases where it's not, it is usually incredibly boring for any hacker to steal, like what web pages John Doe has clicked on in the last 10 minutes. Hackers are going to go after stuff like social security numbers or credit card numbers, not data about the average length of time people spent looking at Pepsi Inc. web page layout A versus web page layout B.
The argument that you should only store "the data that you need" is a circular one. How do you know what data you need? Well, you run an analysis. How do you run an analysis? Using the data you have. Short-sighted policies like throwing away historical data, as the article recommends, effectively blind you to long-term trends.
The bigger problem is the nature of this sort of data. All of this sort of data needs to be invalidated all the time and provided as some sort of hash rather than actual data that is usable.
So JPMorgan took a bit of a reputation hit when hackers compromised the data of 83m of the company's clients. [1] The financial repercussions were minimal, if there even were any to begin with. You can bet they still collect and store all the data they can. They don't really care, and customers don't really care all that much either.
On the other hand, we have medical data, which is essential for pharmaceutical and academic research but would be very, very harmful in the wrong hands. A non-compliant company will get smacked with heavy fines under HIPAA for not safeguarding data in a strict, standardized manner.
Until government regulation makes data breaches substantially costly for Company X (Target, Adobe, LastPass, Department of Personnel Management, etc.), Company X will continue to gather as much data as possible.
It's an asset with unbounded upside (who knows what great economic engine data might fuel in 5-10 years) and no financial downside because it carries no legal risk, and very minimal storage costs.
>>Think this way for a while, and you notice a key factor: old data usually isn’t very interesting. You’ll be much more interested in what your users are doing right now than what they were doing a year ago. Sure, spotting trends in historical data might be cool, but in all likelihood it isn’t actionable. Today’s data is.
Uhhhh, really? There have been a lot of times in my past where, as an analyst, I wished I had an bunch of historical data to measure the seasonality of trends. Am I the only one who baulks at this comment? Is this a startup-oriented perspective?
> In the US, your customers are an asset. In the EU, your customers are people with rights
because that’s what this is about. And I, personally, have to fully support the EU view here, which is that the privacy of a person is more important than the profits of a company.
Another way of looking at the article is that it's more important to have the right data than just having lots of data. So it's very important to think about the questions that one would want to answer and design data collection to capture the answers, rather than just going ahead and storing everything in a brute force way.
When you figure out what it is, you may find you don't have the right data because you weren't asking the right questions, but you sure have the information security liability that goes with it whether or not there is any upside.
There are lots of ways to structure a database that will make whatever information you collected useless or worse than useless for your intended purpose once you figure out what you need to know.
I am inclined to agree with the article. If you aren't actively using it currently, why keep it?
One of the arguments during the Bostom bombings was that "if Boston had, like NYC, a fully integrated CCTV system automatically filming everything and allowing to track people across the city, we could have found the attackers more easily".
One has to seriously think about this. We, as a society, are signing away some of our most important rights for a statistically insignificant boost in security.
I live in Germany, but even here we still feel these changes, which were started with 9/11.
It's now the 14th anniversary of 9/11, and I have to say, long-term, Bin Laden has won. The western world has given up so many rights for the war on terrorism.
Giving up privacy just to collect more data, so we feel safer (while the data is just thrown away) is just another step where we forget that people are actually, well, people. With rights.
The argument from the article is that old data isn't useful, current data is. So once you figure out what you want to look at, you can start looking at current information. Even if you have older records, you're generally not going to end up trawling your old datastores anyway.
A $125 billion market for big data solutions is NOT a sign that data is a liability. What is the value of that data?
I'm also skeptical of the compliance/privacy argument. If you're collecting any data at all, it's a potential liability. The volume of the data doesn't change the risk level much.
1. The externalities cost of data exposure is fully realised. What is Ashley Madison's liability for suicides of members attributed to its data leak? Or of the multi-generational impacts of fractured families (immediate plus affected children, possibly parents of spouses who split on account of the breech). A frequently leveled compaint about "remediation" for data disclosure is that it does little to address the full costs to those victimised by it.
AM is all the more interesting in that, as Annalee Newitz's investigative reporting reveals: not only were personal data collected, but (at least for straight males), it wasn't for the stated purpose: "Ashley Madison created more than 70,000 female bots to send male users millions of fake messages, hoping to create the illusion of a vast playland of available women." In fact, it was a feeder system to a set of bots, "affiliates", and, it seems, escort services.
2. It assumes the market is rational. Enron's market capitalisation was $60 billion on December 31, 2000, prior to its collapse. The share price fell from $0.75 to less than one dollar by November, 2001, its bankruptcy filing was dated December 2, 2001.
The larger problem: what is seen cannot be unseen.
An exceptionally peculiar aspect of digital data is that, while it may remain in the boxes and cages provided for it, it's got a notable tendency to find itself liberated. Often without warning, and not detected for days, weeks, months, or longer, afterward (as in this case). In the real world we've got friction, especially associated with data processing and transfer. In digital form, far less so. Sometimes friction is good.
Finally: that's the market for tools to analyse the data, not for the data itself. Big Data is a current business fad. Companies are told they must capitalise on Big Data, so they buy miracle solutions to do that. Some can, and in cases spectacularly. Many cannot -- the added value-per-customer is small, or, in the case of breach, negative.
I'm quite happy to see others starting to recognize this. It's a problem, as someone who's dealt with "big data" since the early 1990s, that I've been quite well aware of for several decades.
An exceptionally peculiar aspect of digital data is that, while it may remain in the boxes and cages provided for it, it's got a notable tendency to find itself liberated. Often without warning, and not detected for days, weeks, months, or longer, afterward (as in this case). In the real world we've got friction, especially associated with data processing and transfer. In digital form, far less so. Sometimes friction is good.
What you almost always want to do is to roll data up to non-individualised aggregates as soon as practically possible. The rest is just dry powder waiting for a spark.
Totally agree. Was going to add that an expanding schema is a liability but that seems self evident with the comments about code expansion.
If you've ever built a very busy application you know the truth of this. At first the massive access to data you have seems like a gift and you'll log it all "just in case". You might even brag about the volumes of data you have and speculate about their value to investors and internally. But eventually you realize the risk and cost and cost of managing the risk.
Private customer data often includes a liability, but not for the reasons that OP states. The liability is that companies have an ongoing obligation to their customers to protect their private data. However, a lot of data does not have this liability. If it gets published to the world, it wouldn't matter.
This article portrays a common misunderstanding of the accounting terms liability and asset. Just because something has a cost to maintain, it does not make it a liability.
Code is an asset. Data is an asset. Businesses do not value assets at their cost, as the article represents, but in their future economic value.
Asset:
"Things that are resources owned by a company and which have future economic value that can be measured and can be expressed in dollars. Examples include cash, investments, accounts receivable, inventory, supplies, land, buildings, equipment, and vehicles."
Liability:
"Obligations of a company or organization. Amounts owed to lenders and suppliers. Liabilities often have the word "payable" in the account title. Liabilities also include amounts received in advance for a future sale or for a future service to be performed."
A liability can mean something that is a hindrance or puts an individual or group at a disadvantage, or something someone is responsible for, or something that increases the chance of something occurring (i.e. it is a cause).
This is one reason we decided at Cotap to purge messages after 14 days by default, and only keep them longer if requested (https://cotap.com/blog/customizable-data-retention-for-busin...). Contrary to what one might expect, most users embraced the change.
Years ago I worked at a large advertising network that was concerned about fraudulent impressions. E.g. Ads placed "under the fold" or hidden or behind stuff or otherwise generating impressions that weren't real.
I suggested we could build a small piece of supplemental ad code that would load alongside one of our ads in a row page and "look around" — see where ads were placed and so on.
The idea was rejected because it would create too much data. I argued that we could trigger the fraud detection code once per n impressions with n being 100 or 1000 and still be able to identify fraudulent sites with statistical certainty (our problem would be false negatives) but they couldn't wrap their heads around merely sampling enough data to answer a question rather than ALL the data, so the idea was rejected.
Of course it's also highly likely that they didn't actually want to detect fraud.
"Then you collect the data you need (and just the data you need) to answer those questions."
If you are providing something to anonymize activity- then sure, if it is legal. And you want to store as little data as possible that would be hazardous if it was made public. But for everything else, it's probably not a good idea to have this attitude.
There are many questions you don't need to answer before you need them, and then it would be good or even necessary to have them historically. For example, auditing changes made to the system or data, logging some requests and responses, tracking user behavior for analysis by marketing, etc. Over time, depending on the site/service, you may want all those things and more.
Maybe you don't have the resources to analyze the data today, but you still need to collect it for when you do. A history of data can give you a null hypothesis to work with. When we see a drop in traffic in August is it because we're losing relevance, or because August tends to be a slow month.
Data is absolutely an asset. All decisions are driven by data (could be personal, biased, anecdotal, etc.), so why not make decisions based on more data points? It's cute to think you can "ask questions first, then collect" but this wastes time in 2015. Imagine being asked at Netflix "calculate the % watch through of 18-21 year olds for kids movies", then a week later "for action movies in the 1980s. As opposed to having the viewing info for all your users before these questions arise.
BTW, users DO want you to use their data to improve your service. Otherwise google, facebook, twitter, netflix, etc. would not be as successful as they are. Liability only comes into play when OTHER parties access (legally or illegally) your data.
Well, this collecting data is exactly why the US tech industry is considered to be reckless with privacy and safety of data.
And lawcases have shown that the "we record everything" in the ToS is legally null and void – only stuff the user directly expects to be recorded are legal.
If you are a flashlight app, and in your ToS is "we’ll send all your contacts to a third-party company", then this is void, and if you do so, the user can sue you. As has happened with Facebook in the EU several times.
This reckless "we’ll record everything" is why the EU is planning their new "you can't record anything unless directly allowed to, and you can not sell it or give it away to any third-party entities, not even governments" law. (Well, the last part was thanks to the NSA).
I am perfectly fine with that, in fact I expect it when using services like google and facebook. I expect that facebook is crunching my data and feeding me ads, while the advertisers never see who I am or what I'm about.
Well, it's simple: If people are outraged when you tell them what facebook actually stores, then it's not okay.
And the outrage exists.
I do not think anyone should store anything about me unless I explicitly want to give it to them.
One example is Google's horrible traffic tracking system. In my city every lane of every intersection has a separate inductive loop to measure traffic congestion, speed and amount. Also every few hundred meters between intersections. Perfect data, and you only collect as much as necessary.
In comparison, Google's system: Track the location of everyone, and then check if people are moving slowly on a road in a car.
One of these stores an Orwellian amount of information, the other stores just as much as necessary.
Anyway, as a German, I hate this type of data collection. Yes, I openly admit, I hate the business model of the US startup industry. And I hope this fad stops very soon, as currently we have a truly dystopian nightmare of data collection.
Just think about what a less-democratic government could do if they'd get elected in the US in a few years. They could seize Google's data, force them to comply (like the NSA already did) and then use the data against the people.
You have location data of almost everyone, and not just for now, but for every day for the past 5 years. With 5 minutes accuracy. You have banking data and access, emails, browser and search history.
This is too much power that rests in a single place. Potential for abuse is insane.
A company being successful doing X does not mean that X is what users want.
Some users probably want Google etc. to use all data available to improve their services, but some don't. The vast majority of those companies' users don't think about how the companies use their data at all. The services get better, sure, but I don't think the full price has yet been paid.
> users DO want you to use their data to improve your service.
The tech industry really needs to stop projecting what they want to be true. This is a common delusion, unfortunately, and it is exactly this attitude that will eventually bring legislation to restrict data collection.
> Otherwise [...] would not be as successful as they are.
They are successful because they offer a useful way to take advantage of the massive utility available in a General Purpose Computer. Data collection is largely hidden from people, either intentionally or simply because it requires extensive engineering knowledge to understand all of the various tracking techniques and their how they are used.
From personal experience, as I've explained what is actually happening to people, a very common response is, "This cannot possibly be legal. Why aren't they in jail?" This addiction to data only makes you a stalker or peeping tom.
That "engagement" of people using various services isn't people happily - it's people who do not see any other option. When most people don't know about alternatives (if they even exist), most people end up resigned to using services they know are harmful[1].
If you want to see if people actually want to make this kind of exchange, get proper informed consent first. Put it in a contract, and make sure they understand the data they are giving up and what it will be used for. Anything less than this, and you're taking advantage of ignorance.
> Liability only comes into play when OTHER parties access (legally or illegally) your data.
Liability isn't in play right now because it has been kept out of legislation partly from successful lobbying by the telco and software industries, but mostly because our political system is currently a dysfunctional mess. Eventually, this kind of irresponsible, antisocial behavior will piss enough people off, and legislation to fix this liability problem will follow.
It's definitely an asset, but it's potentially a dangerous one. It might be like owning enriched uranium fuel pellets. Extremely valuable asset that can power a society, but dangerous if it falls into the wrong hands or is allowed to contaminate the environment.
> Public data is an asset. Private data is a liability.
I think, more precisely:
Useful data is an asset. Legally sensitive (e.g., because of privacy laws) data is a liability. Those categories overlap, and data in the overlap may be a net asset or a net liability depending on the specific utility it provides balanced against the costs (both certain compliance costs and risks) associated with the specific legal protections.
This isn't really unusual -- the same thing is true of pretty much everything a business (or individual) might own. Real estate, for instance, is an asset to the extent that it is useful, but owning (or even possessing as a tenant) real estate comes with certain maintenance costs and liability-related risks.
First, hat-tip to Marko for running a business with integrity. I really like the parallel between data and finance as it relates to privacy.
As it turns out, banks have a very similar history of making promises & violating them. There are many parallels between banks & debt and data & technology.
I wrote a post titled "Silicon Valley Data is the New Wall Street Debt" that you folks may like:
I think this is a pretty good, general rule. But with everything in technology it's not true for everything. For instance auditing and medical records. Basically the only exceptions are going to be things not actionable in the market.
I read an interesting book once that distinguished clearly between data, information, wisdom and perhaps a couple of other levels of this hierarchy. Advice is usually not data. Data is raw, unprocessed information. Being awash in it is not necessarily valuable, but wisdom is a different matter altogether.
By analogy, data is also a liability. The asset is insight.