This was very surprising to me as well. Assuming that IDs are 8 bytes (they appear to be) and that the mean post is 250 bytes (which seems conservative to me) then the mean fanout is something like 1600. That's big.
It seems that the only way to scale a site like this is to build decoupled services maybe using a messaging queue (AMQP, etc..).
How can I learn more about this kind of systems? Also is there a open source application[s] built in a distributed way that I could learn from?
Exactly. If you don't know what Thrift is, check it out.
In addition to solving the problem space using simple, focused applications (in the classical Unix style), it solves other practical issues. Like PHP having shitty clients for most modern data stores (Cassandra, Redis, hell, even Memcached) compared to what you get in, say, Python, or a JVM language.
I don't really know of any. Nobody writes these sort of distributed applications from scratch. They're almost always evolved from monolithic database-backed applications. And the process of evolution is half development half ops. I doubt that tumblr as it exists today could be deployed from scratch, even if you had all of the source code of all of the systems.
Gearman, briefly mentioned in the article, can probably help you as well. Let's you do cross language, cross platform (a)synchronous function calls . Looks cool, never used it..
I dunno, it seems they might do well to hire based on someone's ability to survive through a technological gauntlet. But the word useless makes the statement kind of....useless.
They've got at least three different programming ecosystems thrown together it looks like. Admittedly some of it may be legacy, but still most large environments I've had experience with have a bit more focused an approach. The more different moving parts you have, the more different things you have to head-scratch at when something breaks and you get a 3am page. ;)
Yeah, I could see a personal project using a good chunk of those. (Apache, nginx, memcached, MySQL, Varnish, PHP) It's definitely not outrageous to think Tumblr uses them all.
This seems like a useless statement. Surely no one is consciously failing people they know fit the team and can perform the job. The whole point of interviews is noisily assessing these things. Point me at a reliable, cost-efficient way of determining that and I'll gladly champion it.
Perhaps their point is that HR and recruitment agencies do just filter candidates by "10 years redis experience" and most good candidates never get noticed.
I don't have C# on my resume. I wouldn't shrink from working on a C# codebase though. Would I get an interview though? Unlikely.
Agreed and to me it sounds incredibly short sighted. People who have survived technological gauntlets know how to adapt and help instill a sense of continuity in fellow team members.
I, for the life of me, don't understand why people defend these gauntlets. Wasn't getting a degree from MIT enough of a gauntlet? Wasn't writing reams and reams of working, robust, well-documented code proof enough of one's software engineering skills?
That's what college is advertised to do: show that you can put up with tons of work and stick it out to see the end you wish to achieve even if the means are not what you had hoped.
Try interviewing, every 2 or 3 years the interviewing crowd has been influenced by their day's generational thinking (not a bad thing just something to be aware of); I for one always welcomed people who have passed through amazing experiences. Apparently from recent interviewing I was/am in the minority. Oh, I meant the gauntlet of experience, learning under fire; I'm not a college graduate.
I got the impression form the article that they were referring to passing through a technological gauntlet during the interview process (ie. complicated programming puzzles) rather than life experience. Is that also what you are referring to?
> Initially an Actor model was used with Finagle, but that was dropped.
The Akka folks do a good job of keep the actor model on your mind when building Scala systems, but it's obviously not always the right approach. I'd love to hear more about how they ended up abandoning the Actor model.
This may be a slightly naive question, but why list both "500 web servers" and "8 nginx"? Seeing as nginx IS a webserver, is it fulfilling a different function (serving static pages?) compared with the Apache servers?
Well, their assets are begin served from a CDN. Are they counting that in their server count?
$ host assets.tumblr.com
assets.tumblr.com is an alias for
assets.tumblr.com.edgesuite.net.
assets.tumblr.com.edgesuite.net is an alias for
a1092.g.akamai.net.
a1092.g.akamai.net has address 69.31.106.32
a1092.g.akamai.net has address 69.31.106.50
After fairly bad review about Tumblr's availability [1] this was something they needed to write. They had (and still have) pretty big challenges. It's not easy to scale something like that and this article confirms that.
I'm stunned by the traffic figures as I don't know one person that uses it (and I suspect the majority of my non-tech, Facebook loving friends have never heard of it).
Is that anything to do with my demographic (41, M, UK based)?
Yes (demographic). Although all sorts of people use tumblr, it seems to be most popular among (predominantly American) high-school age (and a bit older) teenagers. It has settled into a middle ground between Twitter and traditional blogs for a lot of these users -- single-paragraph posts or photo posts are the norm, and re-blogging is very common/encouraged.
Tumblr is not a blogging platform like Wordpress, it's much more a community like Reddit. Yes, you can use Tumblr to host your blog, but the majority of people use the dashboard to interact with others and use their blog to share their content. Very few people will ever see username.tumblr.com, they'll see the posts via tumblr.com/dashboard. Like Twitter, very few people visit twitter.com/citricsquid, they follow me and see my tweets in their stream.
Dashboard is more like a "reader." Everybody you follow is found there. Also any tags you subscribe to are there. Interactions like replies are also found there. I'd compare it to your Facebook feed or Twitter.com stream. It's your base of operations.
Lots of read-only content, and then a decent amount of interactions and events (messages/new follows/like notifications).
The dashboard is where you primarily view all of the posts from the people you follow on Tumblr. Think of it in terms of what you view on Twitter, all of the Tweets from people you follow plus the ability to send out a new Tweet.
You do not write an article about technology by vomiting hundreds of bullet points mentioning every technology a company ever used.
It's an appallingly bad example of technical journalism.
This is not informational, it's just bullet point driven bike shedding.
I understand that hungryblank. But I don't consider myself a journalist. I'm a developer writing for developers so they can hopefully learn something that will help them build cool stuff. To that end I use a very concise easily digestible style with as little extraneous verbiage as possible. Nearly every point in the article could be exploded into a long form article and I already get in trouble for TL;DR :-)
I use Google Reader for two main reasons. One is that once I subscribe to a feed, I get access to all messages, not just new ones. The second is that I can search old messages as well.
So if I understand Tumblrs inbox model correctly, that's exactly the kind of usage pattern that isn't supported.
The UI is very similar to a Twitter dashboard (if you are familiar with that). I don't think it's possible to search for old posts in either case, and the style of the site doesn't really lend itself to that kind of use (for better or worse).
Aside from all this, can anybody imagine what their Amazon bill is like? Lucky them they have not so demanding venture capitalists to pay their bills. I honestly wonder how long they will be able to keep growing like they do without making any money.
It's not that bad. 70% of views are from the Dashboard, they don't see disqus, and I suspect that "most blogs" is an artifact of the circles you run in. Anti-cdotely none of the blogs I follow have disqus.
Really? Didn't see that. Is this a political or a technical comment? When building highly-scalable and stable systems, there isn't much except the JVM. Sure, for some corner cases Erlang might be a solution and if reliability is less important also .NET.
What questions do they need to answer that require the entire graph? It seems like the most complex thing they may need to calculate is friends-of-friends, and that's easy to do even with an adjacency list in a SQL database.
It seems that in New York it was easier to hire people who had worked at scale with the JVM, since that is what the nearby banking institutions might standardise on.
Scala looks a lot more like Ruby than Java does. It makes people feel more comfortable with the language. Also, a lot of the patterns, like collections, are more similar between Scala and Ruby than Java and Ruby. This will change with JDK 8, but for now it is true.
I'm sorry, but I have a hard time believing that anything is harder to scale than Twitter. With an audience of 175 million users, a Twitter field of 140 bytes makes for 22 gigabytes of uncompressed text per person-tweet. If everyone Tweets ten tweets per hour, that's 220 gigabytes. A one-TB HD would only be good for 5 hours or so, meaning Twitter would need to buy a good 3000 hard-drives per year. That's probably more than Amazon has in total.
I was being so extremely sarcastic to show how ridiculous it is to say "harder to scale than Twitter". Of course there are nowhere near 175 million users, of course they don't tweet ten tweets per hour, and of course the tweets aren't uncompressed. My point is that even with all these assumptions, you have a ridiculous lack of scaling problems. My mention of amazon was meant to show that I was absolutely 100% being completely sarcastic when talking about Twitter's "scaling" problems.
I'm assuming you're being sarcastic -- but the "everyone tweets ten Tweets per hour" figure sounds like it's off by at least one (if not 2 or 3) orders of magnitude.
1) there are far fewwer than 175m active users of twitter
2) a tweet compresses to far less than 140 bytes. Text normally compresses to (easily) a third its size
3) Wikipedia says there are some 300 million tweets per day...or 13 gigabytes per day (compressed).
A single 1 TB hard-drive is almost certainly good enough to store a couple of months of tweets. A year of Tweets should fit on a couple of thousand dollasr of hardware.
and 4) I added the Amazon line so that you guys would know I was being absolutely, 100% sarcastic in every way possible. I guess the clue still wasn't enough.
My point is that there are very few things that are as easy to scale as a platform built on sharing 0.13 kilobyte tweets (0.04 KB compressed), in an age where rendering many common front pages requires a browser to download 600-800KB in static and dynamic content. Come on.
Sometimes the Internet gets sarcasm.... and sometimes we fail badly.
600-800KB +++
I was chatting with my brother in law recently about his website (fashion photographer). His homepage was about 5.6MB. When I asked him about it -- he figured it was normal. Sure enough -- he rattled off a few names for me to check. All of them had front-pages >6MB.
You're forgetting about the network effect involved in scaling something like twitter which is commonly overlooked by people judging services like that.
Tumblr is an extremely pageview heavy design, but the industry has moved away from PV as an important metric for a few years now. Fortunately they're nice enough to post their quantcast data publicly: http://www.quantcast.com/p-19UtqE8ngoZbM
They're still phenomenal numbers, but IMHO should be much closer to a 100-server environment than a 1000 server one.
It all depends on the nature of the request. Our nginx servers are tuned to burst to 7000 requests/second and run stable at 4800 requests/second for simple content. Our overall scale is about 1/5th of what Tumblr is doing daily and we have a much smaller infrastructure footprint. Although in fairness our content is no where near as dynamic as theirs is.
I've done ~1b/day on around 6-10 1gb/gogrid instances though admittedly the complexity was lower(ad serving platform) though not just a proxy/static server. When I read this I actually messaged my partner "Imagine what we could do with 1000 servers". I imagine a lot of those servers aren't directly related to serving the site though. The number of support servers required is usually way underestimated.
I think you're on to something when you contrast the 2 problem domains. Number of requests is a very naive way to look at load factors.
At the startup I work, we've got 25-30 Million users, many stats similar to Tumblr, and we're running it on about 250 Ec2 instances of varying size. I think if Tumblr's numbers are high at all -- due to rapid iterations and no time to focus on deep optimizations -- it's maybe 10-15% high, not 90%.
I'm saying this because I've seen periods where our usage numbers are somewhat flat, even falling, but our hardware demands rise as we provide more features. When there's just one or two primary ways to use a service (eg "I post status updates and comment on my friends' status updates") it can be quite easy to optimize. But add features. Photos. Chat. An in-house ad serving platform. I18N. Etc. You have different types of interactions with different acceptable service levels and varying storage requirements.
Holy schnikeys! The graph changes are two orders of magnitude larger than the content additions.