Programmers’ Build Errors: A Case Study at Google [pdf]

zaphar · on June 18, 2014

As a recent employee of Google, Google's build system is the one tool I miss the most. No other available tool outside of google even comes close.

cben · on June 22, 2014

Heh. I also felt that way, spent the first month after leaving looking for next closest thing.

The most exciting thing I found was Nix <https://nixos.org/nix/>, a hermetic build system + linux distro + config-based service manager, all built around a purely functional language.

The way it's actually used to build a lot of existing projects is coarse-grained — the Nix build actions might be simply "get tarball; untar; configure; make". (I think it may be combined with distcc/ccache for per-object caching, not sure).

But if they run existing tool to do the actual builds, how do they enforce hermetic dependencies? They force a specific absolute path structure where all inputs and outputs dirs include the (transitive) hashes. [There is also some crazy rewriting of hashes inside binaries, which they claim works fine.]

The drawback is you can't move binaries or libs around; the benefit is multiple versions can coexist in-place.

I was planning to experiment with some pieces I was missing: - Use bup <https://github.com/bup/bup> to chunk build products for incremental transmission of similar builds. - Their build farm IIUC assumes long-running slaves; I wanted to build on public cloud only paying for burst activity.

Then I suddenly remembered that I write python and now javascript, which I can use without any kind of build, and gave it up :-)

But if I were to toy with hermetic building now, I'd look at basing it on Docker.

gecko · on June 18, 2014

Can you explain why? I pretty much hate all the ones I've worked with; some insight into a system that people actually enjoy to use would be really valuable.

MrBuddyCasino · on June 18, 2014

Yes please. After spending way to much time with Ant / Maven / Gradle, I'm convinced they all got it wrong. A build system that doesn't suck is like a unicorn - rumored to exist, but nobody's seen one yet (except Leiningen, people seem to have no complaints about that one).

stonemetal · on June 18, 2014

From what I understand Leiningen is Maven in lisp clothing. Any complaints you have with Maven might apply to Lein as well, unless it is just the XML you don't like. I have only used it a little, but so far it seems nice.

chas · on June 18, 2014

Google has publically described their build system here http://google-engtools.blogspot.com/2011/08/build-in-cloud-h... (low fidelity) and here https://www.youtube.com/watch?v=2qv3fcXW1mg (high fidelity)

zaphar · on June 23, 2014

Sorry for the late reply. I got distracted by life :-)

So the qualities that make Google's build infrastructure awesome are:

1. Truly incremental builds. By this I mean the build system can reliably detect when something has actually changed and exactly how much it has to rebuild. It then rebuilds only as much as it has to. There are some systems that claim they do this but almost none of them that I'm aware of other than the aforementioned Nix package manager actually do so.

1a. Googles' system does so by forcing all targets to fully specify their inputs and outputs as well as treating all targets as pure functions. No side effects allowed. The inputs determine the outputs so you know if nothing changed in the inputs then the outputs didn't change either. This means all compiles are as close to fully hermetic as you can get.

2. Distributed builds work really well. Building off the fact aforementioned guarantees for build targets. You can also build a very accurate graph of your build targets and effectively distribute all the of building.

3. Build artifact caching. Since your targets are pure functions you can safely cache them for the whole company. When you have thousands of engineers all building the same codebase frequently and the builds are highly distributed and the cache is shared by everyone the chance that someone else has already compiled a large fraction of the code you are about to compile is high.

All of this combines to make builds really fast, really reliable, and really repeatable. Which is kind of the holy grail of build systems when you think about it.

Google happens to have this because they care enough about their developers productivity that the invested thousands of man hours fixing the time sink that is compiling. That and they also build everything from head so they ran into productivity losses from build times really quickly.

cben · on June 23, 2014

I'd add that having the whole company use a single language to describe build and tests is a trivial but crucial thing. It allows automatically building and testing everything affected by your change, even if you never heard about those dependencies.

Oh, and "fully specifying inputs" includes the involved compilers, and even the build tool itself! You can reproduce old builds using the same tools that were used then. Upgrading any tools triggers appropriate recompilations.

im3w1l · on June 18, 2014

Not google's method, but I am very happy with it. Eclipse's automatic incremental builds. Libraries checked in to source control directly.

demallien · on June 18, 2014

Heh. In my last job we used Gentoo as the build system for an embedded system. It was a joy to use. In my current job we use naked Makefiles chained together. I waste somewhere near 50% of the day building things I don't need to build, rebuilding things because the last time they were built it was with the wrong options, fixing build bugs where modifications didn't make it into the final binary etc etc etc. Makes me weep.

But seriously, Gentoo was the bomb. Your package doesn't impact anyone else until it's .ebuild is updated. You can precompile most units of the system once a day and download them as binaries. For 4 years of my life I didn't have to worry how things were being built, except for my own very small part that I had mastered. Bliss.

lifeisstillgood · on June 18, 2014

Can you describe more how it worked? I would assume as it's based on bsd ports that one is simply using a common Make file library and then building fairly Independant components.

The problem presumably starts when decomposing a monolithic build and how do you pull it all together again?

I have stalled at pyholodeck.mikadosoftware.com and need some inspiration to pick that over the 100 other things I needs to do :-)

segmondy · on June 18, 2014

So why don't you move your current system to Gentoo? You know someone has to start it right?

demallien · on June 18, 2014

It's a big system, and not something that you can do piecemeal. You basically need management buy-in. So I've been busy recruiting support from other developers and lower-level managers, writing memos that describe the problem to the higher-ups and so on.

sanxiyn · on June 18, 2014

Findings I found interesting:

1. No correlation between build frequency and build failure ratio. 2. No correlation between build time and build frequency. 3. No correlation between experience and build failure ratio.

Fr3dd1 · on June 18, 2014

For 3. you have to look at the difinition of an experienced and a novice developer they used in this study. I think maybe for google the definition fits but not in its general use.

Roboprog · on June 20, 2014

Agreed. "Experienced" seemed a dubious definition. Experience != Skill, as well.

One of the things in the paper was that there was a population with middling errors that did "many" builds, and a bifurcation of 2 populations that did "few" builds - one with few errors, and one with many errors. (pg 9, section 4.1 "How often do builds fail?")

Could it be that there is a group of gurus who don't need to constantly build, and which batch up testing of larger feature sets, and another group of perpetual freshmen who spend so much time fixing their fault riddled crap that they simply cannot submit very many builds? Yes, I know they attributed at least some of this to the employee role being done, but it makes me wonder. "Experience" would not tease this pattern out. (while it seems to me customary to build more often in the initial stages of using a new language, we've all worked with "that guy" who just bumbles through, year after year)

cpeterso · on June 18, 2014

I think Mozilla's Rust team was considering telemetry reports, not just for rustc compiler crashes, but also for syntax errors in user code. Knowing what kind of syntax errors are most common in user code could provide feedback for language design.

numbsafari · on June 18, 2014

That would be really cool, especially if lots of OSS projects where to enable such reporting.

Rather than building it into the rust tooling, it'd be cool if they built a collection service and then provided hooks into things like Jenkins / Travis CI that are being used by OSS projects. That way you could gather data regardless of the language used.

lifeisstillgood · on June 18, 2014

I have been mulling over a research MSc looking at the metrics available from OSS projects and how they might feed into better software practises (I think we can improve somewhere).

This stuff kind of inspires me - but also makes me quake a little in my boots

seanbanana · on June 20, 2014

Link failure, anyone have a copy?

Fr3dd1 · on June 20, 2014

Just tested it, the links work fine.

dk8996 · on June 19, 2014

So Java > C++ ?

Roboprog · on June 20, 2014

C++ is a pretty easy target to pick on, though. I'm convinced that it is pure machismo that makes anybody use it, rather than a desire for productivity.

Of note on pg 9 is the "resolution time" graph. Look at the time to fix "missing right parentheses". While the median time to fix this C++ error is about 6 or 7 minutes, about 25% of the time it takes so long that the (75 percentile) edge of the box is off the chart - over 50 minutes. (disclaimer - it doesn't say how often this error happens)

What kind of masochist writes such a giant run-on expression that it takes an hour to figure out where to match up the parentheses??? (and doesn't know how to use the parentheses matcher in the editor)

Perhaps there is an untold story for that error, or perhaps C++ just attracts a lot of people who like a challenge.

...

Turbo Pascal (5.5 and above) > C++ as well, but Java was free, and ran on unix.