Not to take away from the substance of the article itself, but is anyone else surprised that they have 2 billion "documents", which presumably means active ads/listings? That seems like an awful lot.
MongoDB is being used for historical archiving, not for the live site itself. The big reason being that changing table schemas for very large sets of old data is painful with MySQL. So the 2 billion number would be any ad/listing older than a set amount of time.
The live data is < 1 TB and is still stored in MySQL.
The "set amount of time" typically hovers around 60 days, though our archiving process has been off for several months while the migration took place. So we have some catching up to do--somewhere in the neighborhood of 150 million postings, last I counted.
I've been hearing some good things about Riak lately and their masterless implementation seems quite interesting. Did Riak ever make your radar and, if so, what were the disadvantages that made you choose MongoDB?
Were I to guess based on the video, I would say lack of a Perl client and you'd probably end up having to roll too many of your own solutions on top of it?
~2.2 billion is a ton of listings. What you have to realize is that the craigslist wasn't in hundreds of cities on day #1. In recent years, we've had tens of millions of "live" ads on the site, but it took a bit of time to grow to that size.
This looks like a data warehousing of the archive. The two billion listings probably represents all expired ads ever. There is no way they have 2 billion active ads at any one time.