The Dangers of the “Google Analytics-Powered Startup”

stdbrouw · on Feb 22, 2015

The criticism of sampling in this blogpost is way overblown. (1) there's virtually no difference between the accuracy of a sample size of 10k and one of 100k, and those are the sample sizes you're usually working with (2) in particular, when working with custom reports you get from the API, you can actually specify the accuracy you want (high, normal, low) as part of your API request.

To boot, 9 times out of 10 what you're interested in as an analyst is not the absolute numbers anyway, but ratios and trends. So whether or not e.g. every precious little pageview gets counted is irrelevant, as long as the way it is counted is stable over time.

This is endemic to so many discussions about big data: "if we don't have every individual data point ever, all hell will break loose and we lose the ability to make any sense of the data." Take an intro to basic probability and statistics, will ya.

There's legitimate problems with using Google Analytics for a startup, but they're mostly related to the fact that it doesn't provide good tooling around A/B, customer lifecycle management and custom metrics – they're possible but you're not making it easy on yourself. These things are the bread and butter of SaaS and app analytics (as opposed to ecommerce or media) so it makes sense to invest in something like Mixpanel/Heap/Keen/KISSMetrics. But those are issues with Google Analytics as a product, not with the quality of its data.

crdb · on Feb 23, 2015

Yes and no.

Exporting visit trends over 5 landing pages and a month? Sure.

Exporting page views for 100,000 products each of whom got 5-100 views? Then that 5% sample is going to exclude most products. The latter approach is however necessary if you're trying to determine how each product category is really performing.

Two alternatives I prefer to Google Analytics Premium (once you get to that size): Webtrekk, a small but competent German company whose product costs around 1/10th as much per year, has a fraction of the bugs, and does reliable unsampled daily dumps (moving to hourly, I believe), although the UI is a little less intuitive; and a self-hosted Piwik instance, so you don't need to worry about data exports. The truth is modern relational databases are incredibly powerful and will easily scale even with information like impressions in onsite search. There are multi-TB instances of Postgres out there. I really suggest installing either in parallel to GA or on their own when you set up tracking.

I do agree with you that anybody involved in any kind of job that includes "analytics" in the title, or indeed most people in management, should take an intro stats course. I particularly like Introduction to Statistical Learning because of its brevity, relatively high abstraction level, and lack of maths.

fndrplayer13 · on Feb 22, 2015

I agree with this in principle, but honestly the sampled numbers can be really misleading. It all depends on the magnitude of your visits. If you ask for a year's worth of data for a company with 10 million visits for the year, you're going to get into trouble. This makes creating tools and automated workflows that use the GA API pretty hard.

Sampling itself makes a ton of sense though, and its why Google can offer this service at the level of quality they do.

dredmorbius · on Feb 23, 2015

Generally yes, sampling works well and is robust even for surprisingly small samples.

For stratified samples -- particularly if you want to stratify traffic by income potential, sampling over all visits may mask data of interest.

As for your example, a 10k and 100k sample will provide a 3.16x difference in the standard deviation (that is: accuracy of estimates of central tendency). That is, 10x more datapoints result in only 3x greater accuracy.

jordanthoms · on Feb 22, 2015

Google Analytics is really designed for, and works well for, _websites_. If your startup is a website, as opposed to an app or service which happens to be on the web, then it's a good option.

For our web app, we use Mixpanel, along with tracking events into our own database. This allows you to track custom events for the things that matter in your app - think 'someone added something to the cart' or 'someone clicked the reply button', not 'someone visited this page'.

Yes, google analytics does let you track custom events, but it's extremely limited compared to mixpanel which lets you attach a properties object with as many custom properties as you like to each event, and then do retroactive analysis on them instantly.

mbesto · on Feb 22, 2015

> It’s a bit like how Gallup can summarise Indonesians’ smartphone habits by calling 1,500 of them; it works fine if you’re looking for a general pattern, but it might skew the data if you’re looking for data about a tiny niche of smartphone users or if Gallup happened to call up relatively too many Nokia users that day.

This is an incorrect statement and interpretation of how statistics work. Is there a chance that the 1,500 Indonesians that they call that day not representative of the over population? Of course, but the probability of that is very low. This concept specifically is known as statistical significance[0]. The conclusion is: sampling error can lead to incorrect conclusions, but if you can eliminate any biases to your sampling, then it can indeed be representative. Personally, the more important take away is this: before you start deriving conclusions from your metrics, it's necessary to fully grok the concept of statistics.

[0] - http://en.wikipedia.org/wiki/Statistical_significance

exelius · on Feb 22, 2015

Completely agree. Sampling is fine; but you need to understand that the free version of Google Analytics isn't a replacement for a proper BI tool. GA also doesn't even do sampling until your traffic is over a certain threshold, so at low traffic levels it's actually accurate (though those levels are also low enough that you probably shouldn't try to draw too many insights).

This is not to say that GA is bad; in fact I would call you stupid if your web startup went out and bought a tool before you launched. GA is free, it works well, it's super-easy to implement, and people should use it; but once you've reached the point where you can afford something better (usually in addition to GA, not in replacement of) you should look into doing so.

redredredred · on Feb 22, 2015

I think Google Analytics is much better than nothing, and until you reach a certain scale (I used the estimate of ~1 million monthly sessions) I think it's sufficient.

What I am mainly arguing is that relying completely on GA for reports and performance measurements is dangerous and frustrating.

- Simon

fiatjaf · on Feb 22, 2015

I have various low traffic websites, and just because I hated having to go through the burden of creating accounts for all of them in Google Analytics, I wrote a very simple web analytics engine called Microanalytics[1].

It is a Couchapp (which means it only takes a CouchDB database to work, no other server or backend) and it allows for emitting of custom events with a simple `ma(event, [optional_value])`.

Every event is tied to a session, so later you can analyse and filter events based on session, see exactly who did what on your site, see if the same user came back at some other day, things like this.

So, in my small websites I can clearly see when a person enters the site and all that. Also, 1 visitor makes a difference, and when I tested having Google Analytics alongside Microanalytics what I saw was that Google Analytics statistics showed a lot more visitors than Microanalytics. I know Microanalytics can't be wrong, because it literally counts me in real time when I enter the site, so I don't know what to think. The only thing it doesn't count is visitors without Javascript, but are there so much of them? Does Google Analytics count them? I think not.

---

Also good to say: the way Microanalytics does data visualization is through a command line tool that prints to STDIN, so you can do all sorts of things with Unix pipes. For example, for doing an A/B Test experiment once, I just called `ma('version', versionName)` in each tested page, `ma('conversion', 'converted')` when appropriate, and later ran the following:

    for name in versionA versionB
        echo $name
        set v (microanalytics identifier inspect sessions --limit 300 | grep $name | wc -l)
        set c (microanalytics identifier inspect sessions --limit 300 | grep $name | grep converted | wc -l)
        echo $v $c (echo "$c / $v" | bc -l)
        echo
    end

(This example is in the fish shell, but you can do the same in bash, obviously.)

[1]: https://github.com/fiatjaf/microanalytics

j_s · on Feb 23, 2015

Awesome work!

You'd have to dig into the details (referer, user agent, etc.) on the Google Analytics side to see the differences... GA probably tracks every random web scraping (search engine) hit by monitoring the loading of the JavaScript file.

fiatjaf · on Feb 23, 2015

Oh, I forgot they could monitor the loading of the tracking file!

So that's it, I can't do that in CouchDB.

harryf · on Feb 22, 2015

What drives me nuts about Google Analytics is their foot-dragging on mobile.

Despite the mobile version of Google Analytics, trying to do anything meaningful, like analyse app retention, it's a huge pain.

All the newer players to mobile analytics, like Flury or Localytics, have this stuff nailed but Google Analytics leaves you with out-dated, web-oriented reports like the "New vs. Returning" and "Loyalty" and generally prefers to push you towards looking at sessions instead of actually understanding what your users are doing.

OK time to stop here before I start to rant...

redredredred · on Feb 22, 2015

I haven't explored the current functionality for mobile apps, but we're currently working multi-funnel analysis and it's a complete nightmare given that most purchases journeys happen across more than one device. And that's with GA Premium..

- Simon

raverbashing · on Feb 22, 2015

So, what happened to collecting traffic information yourself?

You know, from Apache logs and similar tools.

Yes, I know google analytics gives a lot of bells and whistles and tools and whatnot.

But it's still YOUR website, YOUR data, YOUR traffic. YOU can get (most of) the info that GA is sampling down.

taf2 · on Feb 22, 2015

Collecting data from YOUR website - costs time/money. Google Analytics is FREE and low cost time/money to install.

It's true the data is there and we can get to it... but this takes some foresight that having installed analytics does not require. e.g. a few months after the original apache logs have rotated - you realize it would be nice to know how many mobile safari users are coming to your site vs chrome android users... because you're trying to determine the impact of releasing that next feature that requires webrtc. You can add the analysis to future traffic and give it a few extra weeks but wouldn't it be nice to know right now?

GA is nice because it lets you avoid the investment in time/money and you can still go back and look at historical data points you might not have imagined being useful before...

I think in this way sampling is really not a big deal... if it is you're right - absolute right. collect the data points that require this level of accuracy yourself... but I don't think if you need that level of accuracy early on you're focusing on the right things...

jlarocco · on Feb 22, 2015

Not sure I agree with that.

There are, or used to be, plenty of tools for parsing the web logs, crunching the numbers, and spitting out pretty graphs.

Again, maybe not as pretty as GA, and not as many bells and whistles, but good enough.

taf2 · on Feb 22, 2015

but that can't be as easy as copy pasting a snippet of javascript code into your website?

jlarocco · on Feb 24, 2015

Actually, copy/pasting javascript would be MORE work.

The hosting company I use, all I have to do is go to "http://mydomain.com/stats" and enter in the username and password I setup. No javascript changes necessary, because everything is in apache or nginx access logs, which have been around for just about ever.

Kinda sad/funny that younger web devs don't know this stuff.

raverbashing · on Feb 22, 2015

Oh wow, yeah, reading files in /var/log/httpd/ is soooo hard.

I agree that not all people have the capability to work with this (read, front-end developers, Wordpress customizers, etc) still, I mean, even CPanel lets you download it from past months.

richardbrevig · on Feb 22, 2015

cPanel has multiple log analyzers built in. So the user doesn't have to go through that work. If your site is on a shared host and it uses cPanel, just log into cPanel and you'll find them.

gaius · on Feb 22, 2015

Collecting data from YOUR website - costs time/money. Google Analytics is FREE and low cost time/money to install

It would appear to be just free enough to get you hooked, but to make it accurately give useful results, then it's very, very expensive, is the thrust of the article.

jszymborski · on Feb 22, 2015

Piwik _gives_ you all those bells and whistles, and is 100% open-source and self-hosted.

http://piwik.org/

taf2 · on Feb 22, 2015

No doubt Piwik is great but again it's not as easy to setup or maintain as google analytics... GA is easy and free but there's costs in that you might get sampled data if you're doing high volume of traffic... okay but there is also cost in self hosting when you're doing high volume of traffic... no free lunch applies..

fiatjaf · on Feb 22, 2015

I just wrote in another comment[1] about Microanalytics[2]. It is an easy Javascript option that gives you a lot of cool stuff. If you're not a powerful corporation it is definitely worth a try.

[1]: https://news.ycombinator.com/item?id=9090721

[2]: https://github.com/fiatjaf/microanalytics

lmkg · on Feb 22, 2015

Three big reasons.

1. If you use a CDN or other sort of caching service, a significant fraction of traffic never hits your server, and your logs are incomplete.

2. Logfile parsing does not measure on-page events that do not involve a server call (such as certain UI interactions, and clicking exit links out to other websites).

3. Parsing logfiles gives crap data.

There's a ton of filtering and heuristics that are built into modern web analytics tools. It's a back-breaking task to re-implement them, and your data is literally less-than-useless until you're at leas part-way through the list (as in, provides negative value because it drives you to incorrect conclusions).

For example: Web crawlers. Google, Bing, Ask, Baidu, Yandex, and a dozen smaller or region-specific search engines all generating automated traffic to your site. Some of their IP's are public, but Google intentionally runs anonymous web crawlers to make sure people aren't sending Google different pages than what they send humans. And those are just the legimate robots. By some estimates, as much as 50% of modern web traffic is robots. The majority of these are spam bots of one form or another. You know, those one that troll forums with ad links or malware? Those guys aren't negligible.

Another example: Associating multiple visit sessions from the same person. This is a fuzzy statistical decision based on cookies, IP address, user agent string, device type, and other data. It's complicated, with a lot of edge cases. Modern tools aren't 100% accurate, but they've already seen and accounted for a decade of exceptions.

Another example: Cache-busting. Networks and browsers will cache server calls (including beacon-based tracking), no matter how much you explicitly tell them not to. Unless you specifically write scripting to add random tokens to each of your server calls, you will actually be working with data that has been sampled to an unknown extent. And unlike GA, the sampling is page-based rather than visitor-based which means that visit sessions will have holes in them.

In my work as a web analyst, I worked with one company was moving off their own home-rolled system based on Apache log files. They were even putting information from this into their quarterly earnings report. Once we set them up with a "real" system, we found that their homebrew system had over-stated traffic by a factor of ten. Some combination of bot traffic, and users on consumer broadband having variable IP addresses, and cross-domain cookie tracking meant that the web aspect of their business had only 10% of the reach they previously thought.

THAT SAID... rolling your own home-brew system is sometimes the correct answer. I'm working with two such clients right now, and they are making the decision that is right for them. It's not the right decision for the majority of companies--it's a large engineering effort, and it needs to have a significant expect payout.

raverbashing · on Feb 22, 2015

Oh I agree that httpd log data does not contain all information. But neither does JS based analytics data (example, crawler data - even though Google Crawler runs some js, also, some issues with https and GA)

I just think that's not productive to complain about info that GA is "hiding" when it's most likely in your logs somehow.

You can also add your own js snippet to your pages to help you track some aspects.

alexdean · on Feb 22, 2015

I'd definitely suggest your two companies looking to home-brew check out Snowplow as well (https://github.com/snowplow/snowplow). If there's something they would need that isn't available out of the box yet in Snowplow let me know - details in profile!

nlh · on Feb 22, 2015

A quick spin through Segment.com's list of integrations show a TON of GA-like services. Anyone have any suggestions for an alternative to GA that provides the same type of data without the sampling / bloat / etc of GA?

gk1 · on Feb 22, 2015

Hey Noah, an unfortunate side effect of GA's power is its steeper learning curve. A new driver sitting in a Ferrari might consider all the paddle shifters and gauges to be "bloat," but in the right hands the car is a beast ...

Don't get me wrong, I do wish Google spent more time improving GA, but I think many of the alternatives sacrifice too much just to be beginner-friendly. What we need is a Tesla for web analytics.

nlh · on Feb 22, 2015

Agreed - I've used GA for many years and it's indeed extremely powerful (especially for a free tool). And the tradeoff is fair, I think -- Google gives us a ton of power to analyze site data in exchange for, well, Google being able to analyze site data :) But it does take a bit of a kitchen-sink approach to things.

I guess I should rephrase my question to be less broad -- of the companies that purport to be competitive, do any of them do a particularly good job (even if just at subsets of GA's features)?

kumarm · on Feb 22, 2015

tl;dr: Google Analytics is a great tool and informs you exactly what it does very well. You need to read what it does though.

I would say its a click bait.

grey-area · on Feb 22, 2015

Having dealt with lots of customers who take the stats generated by GA as gospel despite multiple problems, I disagree.

Some example problems: referrals unreliable, countries unreliable, sampling distorting figures, no warnings when data displayed is based on very little data, sessions often misinterpreted as clicks by users, inexplicable disparities with other methods of tracking because their methods are pretty opaque and because of sampling, in-page analytics looking deceptively like click-tracking when in most cases it uses page load data. Some of that you can attribute to user error, but it is not good for the market that google dominates tracking like this, and GA is sometimes misleading.

Just as an example of a problem I ran into recently - the free GA doesn't offer referral stats for https websites, but this isn't made clear to end users. As a result they simply trust that referrals have collapsed if a referring site switches to https.

gk1 · on Feb 22, 2015

The https referral issue isn't unique to GA, nor is it their fault. The browser doesn't pass a referrer value for HTTPS->HTTP traffic. Best way around this is to use HTTPS yourself, or use custom UTM tags on your link (if you have any say).

grey-area · on Feb 22, 2015

GA gives prominence to referrals in a world where more and more sites use https. People are making real business decisions based on this flawed stat, and most people have no idea that this is in fact a useless stat without https if any of your referrers are secure pages. Google says website x was your top referrer, and they just take that as truth.

I think they should just take it out and recommend using campaign links or landing pages or at the very least make it very clear that this is only a partial and distorted view. Same goes for in-page analytics etc. the presentation of overlays on page items (implying clicks) is misleading.

exelius · on Feb 22, 2015

I don't think OP is saying GA sucks; I think he's saying that you can't take the results that GA gives you as being to-the-numbers accurate. They're still directionally accurate, and GA is so easy to implement that it gives you a lot of bang for relatively little effort.

Many people actually don't realize that GA does sampling, or the effect said sampling can have on their numbers.

redredredred · on Feb 22, 2015

Exactly. Knowing its limitations, and being able to communicate them well to the people you need to convince to succeed in your job, is the right way to go.

- Simon

nl · on Feb 22, 2015

Please don't sign comments; they're already signed with your username. If other users want to learn more about you, they can click on it to see your profile.

https://news.ycombinator.com/newsguidelines.html

redredredred · on Feb 22, 2015

Thanks for the heads up. I'm quite new here :)

gumby · on Feb 22, 2015

I agree with your summary but don't agree that it's click bait. I read it as the author's warning that you should RTFM when deploying GA, which I doubt most people do.

redredredred · on Feb 22, 2015

In a perfect world where everyone understands every single data point completely or you don't have to work with everyone else that's completely true. That kind of thinking is however extremely limiting to 9/10 organisations out there because team work and team buy-in is so important for how decisions are made. Believing that GA can do anything and everything, which is the belief of more managers than you'd like to think, is exactly the underlying issue.

- Simon

h1fra · on Feb 22, 2015

The real issue if find about that, is StartUp with this kind of thinking tend to put number before people or actual results. And most of the time, you cannot reduce problem or solution with crunching number.

Once the breach is opened, it became a nightmare explaining your coworker that a small drop in Session does not probably need a complete restructuration of the website.

Currently working in this kind of startup, doing a good job is actually impossible as we changed everything every 3 month or so, because number changed in analytics.

redredredred · on Feb 22, 2015

Yeah that's exactly the kind of frustrations that I've experienced countless times and hope to get better at combatting by exploring how GA really works and how/where it's limited. Slowly getting there..

- Simon

fndrplayer13 · on Feb 22, 2015

I loved the article. I think it tracks very much in line with the kind of experiences that I have personally had in working with Google Analytics data -- both via their reporting API and in the dashboard itself. I think GA is a great service though, when the caveats are considered and understood. It's great that you took the opportunity to state and consolidate this information in one place.

I also just want to shout out that I work for a startup called Narrative Science that offers a free product called Quill Engage (https://quillengage.narrativescience.com/) that can help identify some of the key insights from your GA data in a free weekly and monthly automated report.

BradRuderman · on Feb 22, 2015

I agree with most of the points. We use GA for general traffic patterns, definitely not conversion tracking. It is horrible at true conversion tracking.

gk1 · on Feb 22, 2015

What kind of conversions are you trying to track? Although it takes some configuration, I've had no problems tracking all kinds of conversions for different sites--ecommerce, SaaS, leadgen, etc.

jordanthoms · on Feb 23, 2015

The worst thing about GA for conversion tracking is it's not retroactive, so you have to go though a slow configure/wait a few days/see if the data looks right/reconfigure cycle.

totoroisalive · on Feb 22, 2015

Not free, you're giving your users to google massive tracking machine.