The criticism of sampling in this blogpost is way overblown. (1) there's virtually no difference between the accuracy of a sample size of 10k and one of 100k, and those are the sample sizes you're usually working with (2) in particular, when working with custom reports you get from the API, you can actually specify the accuracy you want (high, normal, low) as part of your API request.
To boot, 9 times out of 10 what you're interested in as an analyst is not the absolute numbers anyway, but ratios and trends. So whether or not e.g. every precious little pageview gets counted is irrelevant, as long as the way it is counted is stable over time.
This is endemic to so many discussions about big data: "if we don't have every individual data point ever, all hell will break loose and we lose the ability to make any sense of the data." Take an intro to basic probability and statistics, will ya.
There's legitimate problems with using Google Analytics for a startup, but they're mostly related to the fact that it doesn't provide good tooling around A/B, customer lifecycle management and custom metrics – they're possible but you're not making it easy on yourself. These things are the bread and butter of SaaS and app analytics (as opposed to ecommerce or media) so it makes sense to invest in something like Mixpanel/Heap/Keen/KISSMetrics. But those are issues with Google Analytics as a product, not with the quality of its data.
Exporting visit trends over 5 landing pages and a month? Sure.
Exporting page views for 100,000 products each of whom got 5-100 views? Then that 5% sample is going to exclude most products. The latter approach is however necessary if you're trying to determine how each product category is really performing.
Two alternatives I prefer to Google Analytics Premium (once you get to that size): Webtrekk, a small but competent German company whose product costs around 1/10th as much per year, has a fraction of the bugs, and does reliable unsampled daily dumps (moving to hourly, I believe), although the UI is a little less intuitive; and a self-hosted Piwik instance, so you don't need to worry about data exports. The truth is modern relational databases are incredibly powerful and will easily scale even with information like impressions in onsite search. There are multi-TB instances of Postgres out there. I really suggest installing either in parallel to GA or on their own when you set up tracking.
I do agree with you that anybody involved in any kind of job that includes "analytics" in the title, or indeed most people in management, should take an intro stats course. I particularly like Introduction to Statistical Learning because of its brevity, relatively high abstraction level, and lack of maths.
I agree with this in principle, but honestly the sampled numbers can be really misleading. It all depends on the magnitude of your visits. If you ask for a year's worth of data for a company with 10 million visits for the year, you're going to get into trouble. This makes creating tools and automated workflows that use the GA API pretty hard.
Sampling itself makes a ton of sense though, and its why Google can offer this service at the level of quality they do.
Generally yes, sampling works well and is robust even for surprisingly small samples.
For stratified samples -- particularly if you want to stratify traffic by income potential, sampling over all visits may mask data of interest.
As for your example, a 10k and 100k sample will provide a 3.16x difference in the standard deviation (that is: accuracy of estimates of central tendency). That is, 10x more datapoints result in only 3x greater accuracy.
Google Analytics is really designed for, and works well for, _websites_. If your startup is a website, as opposed to an app or service which happens to be on the web, then it's a good option.
For our web app, we use Mixpanel, along with tracking events into our own database. This allows you to track custom events for the things that matter in your app - think 'someone added something to the cart' or 'someone clicked the reply button', not 'someone visited this page'.
Yes, google analytics does let you track custom events, but it's extremely limited compared to mixpanel which lets you attach a properties object with as many custom properties as you like to each event, and then do retroactive analysis on them instantly.
> It’s a bit like how Gallup can summarise Indonesians’ smartphone habits by calling 1,500 of them; it works fine if you’re looking for a general pattern, but it might skew the data if you’re looking for data about a tiny niche of smartphone users or if Gallup happened to call up relatively too many Nokia users that day.
This is an incorrect statement and interpretation of how statistics work. Is there a chance that the 1,500 Indonesians that they call that day not representative of the over population? Of course, but the probability of that is very low. This concept specifically is known as statistical significance[0]. The conclusion is: sampling error can lead to incorrect conclusions, but if you can eliminate any biases to your sampling, then it can indeed be representative. Personally, the more important take away is this: before you start deriving conclusions from your metrics, it's necessary to fully grok the concept of statistics.
Completely agree. Sampling is fine; but you need to understand that the free version of Google Analytics isn't a replacement for a proper BI tool. GA also doesn't even do sampling until your traffic is over a certain threshold, so at low traffic levels it's actually accurate (though those levels are also low enough that you probably shouldn't try to draw too many insights).
This is not to say that GA is bad; in fact I would call you stupid if your web startup went out and bought a tool before you launched. GA is free, it works well, it's super-easy to implement, and people should use it; but once you've reached the point where you can afford something better (usually in addition to GA, not in replacement of) you should look into doing so.
I think Google Analytics is much better than nothing, and until you reach a certain scale (I used the estimate of ~1 million monthly sessions) I think it's sufficient.
What I am mainly arguing is that relying completely on GA for reports and performance measurements is dangerous and frustrating.
I have various low traffic websites, and just because I hated having to go through the burden of creating accounts for all of them in Google Analytics, I wrote a very simple web analytics engine called Microanalytics[1].
It is a Couchapp (which means it only takes a CouchDB database to work, no other server or backend) and it allows for emitting of custom events with a simple `ma(event, [optional_value])`.
Every event is tied to a session, so later you can analyse and filter events based on session, see exactly who did what on your site, see if the same user came back at some other day, things like this.
So, in my small websites I can clearly see when a person enters the site and all that. Also, 1 visitor makes a difference, and when I tested having Google Analytics alongside Microanalytics what I saw was that Google Analytics statistics showed a lot more visitors than Microanalytics. I know Microanalytics can't be wrong, because it literally counts me in real time when I enter the site, so I don't know what to think. The only thing it doesn't count is visitors without Javascript, but are there so much of them? Does Google Analytics count them? I think not.
---
Also good to say: the way Microanalytics does data visualization is through a command line tool that prints to STDIN, so you can do all sorts of things with Unix pipes. For example, for doing an A/B Test experiment once, I just called `ma('version', versionName)` in each tested page, `ma('conversion', 'converted')` when appropriate, and later ran the following:
for name in versionA versionB
echo $name
set v (microanalytics identifier inspect sessions --limit 300 | grep $name | wc -l)
set c (microanalytics identifier inspect sessions --limit 300 | grep $name | grep converted | wc -l)
echo $v $c (echo "$c / $v" | bc -l)
echo
end
(This example is in the fish shell, but you can do the same in bash, obviously.)
You'd have to dig into the details (referer, user agent, etc.) on the Google Analytics side to see the differences... GA probably tracks every random web scraping (search engine) hit by monitoring the loading of the JavaScript file.
What drives me nuts about Google Analytics is their foot-dragging on mobile.
Despite the mobile version of Google Analytics, trying to do anything meaningful, like analyse app retention, it's a huge pain.
All the newer players to mobile analytics, like Flury or Localytics, have this stuff nailed but Google Analytics leaves you with out-dated, web-oriented reports like the "New vs. Returning" and "Loyalty" and generally prefers to push you towards looking at sessions instead of actually understanding what your users are doing.
I haven't explored the current functionality for mobile apps, but we're currently working multi-funnel analysis and it's a complete nightmare given that most purchases journeys happen across more than one device. And that's with GA Premium..
Collecting data from YOUR website - costs time/money.
Google Analytics is FREE and low cost time/money to install.
It's true the data is there and we can get to it... but this takes some foresight that having installed analytics does not require. e.g. a few months after the original apache logs have rotated - you realize it would be nice to know how many mobile safari users are coming to your site vs chrome android users... because you're trying to determine the impact of releasing that next feature that requires webrtc. You can add the analysis to future traffic and give it a few extra weeks but wouldn't it be nice to know right now?
GA is nice because it lets you avoid the investment in time/money and you can still go back and look at historical data points you might not have imagined being useful before...
I think in this way sampling is really not a big deal... if it is you're right - absolute right. collect the data points that require this level of accuracy yourself... but I don't think if you need that level of accuracy early on you're focusing on the right things...
Actually, copy/pasting javascript would be MORE work.
The hosting company I use, all I have to do is go to "http://mydomain.com/stats" and enter in the username and password I setup. No javascript changes necessary, because everything is in apache or nginx access logs, which have been around for just about ever.
Kinda sad/funny that younger web devs don't know this stuff.
Oh wow, yeah, reading files in /var/log/httpd/ is soooo hard.
I agree that not all people have the capability to work with this (read, front-end developers, Wordpress customizers, etc) still, I mean, even CPanel lets you download it from past months.
cPanel has multiple log analyzers built in. So the user doesn't have to go through that work. If your site is on a shared host and it uses cPanel, just log into cPanel and you'll find them.
Collecting data from YOUR website - costs time/money. Google Analytics is FREE and low cost time/money to install
It would appear to be just free enough to get you hooked, but to make it accurately give useful results, then it's very, very expensive, is the thrust of the article.
No doubt Piwik is great but again it's not as easy to setup or maintain as google analytics... GA is easy and free but there's costs in that you might get sampled data if you're doing high volume of traffic... okay but there is also cost in self hosting when you're doing high volume of traffic... no free lunch applies..
I just wrote in another comment[1] about Microanalytics[2]. It is an easy Javascript option that gives you a lot of cool stuff. If you're not a powerful corporation it is definitely worth a try.
1. If you use a CDN or other sort of caching service, a significant fraction of traffic never hits your server, and your logs are incomplete.
2. Logfile parsing does not measure on-page events that do not involve a server call (such as certain UI interactions, and clicking exit links out to other websites).
3. Parsing logfiles gives crap data.
There's a ton of filtering and heuristics that are built into modern web analytics tools. It's a back-breaking task to re-implement them, and your data is literally less-than-useless until you're at leas part-way through the list (as in, provides negative value because it drives you to incorrect conclusions).
For example: Web crawlers. Google, Bing, Ask, Baidu, Yandex, and a dozen smaller or region-specific search engines all generating automated traffic to your site. Some of their IP's are public, but Google intentionally runs anonymous web crawlers to make sure people aren't sending Google different pages than what they send humans. And those are just the legimate robots. By some estimates, as much as 50% of modern web traffic is robots. The majority of these are spam bots of one form or another. You know, those one that troll forums with ad links or malware? Those guys aren't negligible.
Another example: Associating multiple visit sessions from the same person. This is a fuzzy statistical decision based on cookies, IP address, user agent string, device type, and other data. It's complicated, with a lot of edge cases. Modern tools aren't 100% accurate, but they've already seen and accounted for a decade of exceptions.
Another example: Cache-busting. Networks and browsers will cache server calls (including beacon-based tracking), no matter how much you explicitly tell them not to. Unless you specifically write scripting to add random tokens to each of your server calls, you will actually be working with data that has been sampled to an unknown extent. And unlike GA, the sampling is page-based rather than visitor-based which means that visit sessions will have holes in them.
In my work as a web analyst, I worked with one company was moving off their own home-rolled system based on Apache log files. They were even putting information from this into their quarterly earnings report. Once we set them up with a "real" system, we found that their homebrew system had over-stated traffic by a factor of ten. Some combination of bot traffic, and users on consumer broadband having variable IP addresses, and cross-domain cookie tracking meant that the web aspect of their business had only 10% of the reach they previously thought.
THAT SAID... rolling your own home-brew system is sometimes the correct answer. I'm working with two such clients right now, and they are making the decision that is right for them. It's not the right decision for the majority of companies--it's a large engineering effort, and it needs to have a significant expect payout.
Oh I agree that httpd log data does not contain all information. But neither does JS based analytics data (example, crawler data - even though Google Crawler runs some js, also, some issues with https and GA)
I just think that's not productive to complain about info that GA is "hiding" when it's most likely in your logs somehow.
You can also add your own js snippet to your pages to help you track some aspects.
I'd definitely suggest your two companies looking to home-brew check out Snowplow as well (https://github.com/snowplow/snowplow). If there's something they would need that isn't available out of the box yet in Snowplow let me know - details in profile!
A quick spin through Segment.com's list of integrations show a TON of GA-like services. Anyone have any suggestions for an alternative to GA that provides the same type of data without the sampling / bloat / etc of GA?
Hey Noah, an unfortunate side effect of GA's power is its steeper learning curve. A new driver sitting in a Ferrari might consider all the paddle shifters and gauges to be "bloat," but in the right hands the car is a beast ...
Don't get me wrong, I do wish Google spent more time improving GA, but I think many of the alternatives sacrifice too much just to be beginner-friendly. What we need is a Tesla for web analytics.
Agreed - I've used GA for many years and it's indeed extremely powerful (especially for a free tool). And the tradeoff is fair, I think -- Google gives us a ton of power to analyze site data in exchange for, well, Google being able to analyze site data :) But it does take a bit of a kitchen-sink approach to things.
I guess I should rephrase my question to be less broad -- of the companies that purport to be competitive, do any of them do a particularly good job (even if just at subsets of GA's features)?
Having dealt with lots of customers who take the stats generated by GA as gospel despite multiple problems, I disagree.
Some example problems: referrals unreliable, countries unreliable, sampling distorting figures, no warnings when data displayed is based on very little data, sessions often misinterpreted as clicks by users, inexplicable disparities with other methods of tracking because their methods are pretty opaque and because of sampling, in-page analytics looking deceptively like click-tracking when in most cases it uses page load data. Some of that you can attribute to user error, but it is not good for the market that google dominates tracking like this, and GA is sometimes misleading.
Just as an example of a problem I ran into recently - the free GA doesn't offer referral stats for https websites, but this isn't made clear to end users. As a result they simply trust that referrals have collapsed if a referring site switches to https.
The https referral issue isn't unique to GA, nor is it their fault. The browser doesn't pass a referrer value for HTTPS->HTTP traffic. Best way around this is to use HTTPS yourself, or use custom UTM tags on your link (if you have any say).
GA gives prominence to referrals in a world where more and more sites use https. People are making real business decisions based on this flawed stat, and most people have no idea that this is in fact a useless stat without https if any of your referrers are secure pages. Google says website x was your top referrer, and they just take that as truth.
I think they should just take it out and recommend using campaign links or landing pages or at the very least make it very clear that this is only a partial and distorted view. Same goes for in-page analytics etc. the presentation of overlays on page items (implying clicks) is misleading.
I don't think OP is saying GA sucks; I think he's saying that you can't take the results that GA gives you as being to-the-numbers accurate. They're still directionally accurate, and GA is so easy to implement that it gives you a lot of bang for relatively little effort.
Many people actually don't realize that GA does sampling, or the effect said sampling can have on their numbers.
Exactly. Knowing its limitations, and being able to communicate them well to the people you need to convince to succeed in your job, is the right way to go.
Please don't sign comments; they're already signed with your username. If other users want to learn more about you, they can click on it to see your profile.
I agree with your summary but don't agree that it's click bait. I read it as the author's warning that you should RTFM when deploying GA, which I doubt most people do.
In a perfect world where everyone understands every single data point completely or you don't have to work with everyone else that's completely true. That kind of thinking is however extremely limiting to 9/10 organisations out there because team work and team buy-in is so important for how decisions are made. Believing that GA can do anything and everything, which is the belief of more managers than you'd like to think, is exactly the underlying issue.
The real issue if find about that, is StartUp with this kind of thinking tend to put number before people or actual results.
And most of the time, you cannot reduce problem or solution with crunching number.
Once the breach is opened, it became a nightmare explaining your coworker that a small drop in Session does not probably need a complete restructuration of the website.
Currently working in this kind of startup, doing a good job is actually impossible as we changed everything every 3 month or so, because number changed in analytics.
Yeah that's exactly the kind of frustrations that I've experienced countless times and hope to get better at combatting by exploring how GA really works and how/where it's limited. Slowly getting there..
I loved the article. I think it tracks very much in line with the kind of experiences that I have personally had in working with Google Analytics data -- both via their reporting API and in the dashboard itself. I think GA is a great service though, when the caveats are considered and understood. It's great that you took the opportunity to state and consolidate this information in one place.
I also just want to shout out that I work for a startup called Narrative Science that offers a free product called Quill Engage (https://quillengage.narrativescience.com/) that can help identify some of the key insights from your GA data in a free weekly and monthly automated report.
I agree with most of the points. We use GA for general traffic patterns, definitely not conversion tracking. It is horrible at true conversion tracking.
What kind of conversions are you trying to track? Although it takes some configuration, I've had no problems tracking all kinds of conversions for different sites--ecommerce, SaaS, leadgen, etc.
The worst thing about GA for conversion tracking is it's not retroactive, so you have to go though a slow configure/wait a few days/see if the data looks right/reconfigure cycle.
To boot, 9 times out of 10 what you're interested in as an analyst is not the absolute numbers anyway, but ratios and trends. So whether or not e.g. every precious little pageview gets counted is irrelevant, as long as the way it is counted is stable over time.
This is endemic to so many discussions about big data: "if we don't have every individual data point ever, all hell will break loose and we lose the ability to make any sense of the data." Take an intro to basic probability and statistics, will ya.
There's legitimate problems with using Google Analytics for a startup, but they're mostly related to the fact that it doesn't provide good tooling around A/B, customer lifecycle management and custom metrics – they're possible but you're not making it easy on yourself. These things are the bread and butter of SaaS and app analytics (as opposed to ecommerce or media) so it makes sense to invest in something like Mixpanel/Heap/Keen/KISSMetrics. But those are issues with Google Analytics as a product, not with the quality of its data.