Tumblr Architecture - 15B Page Views a Month and Harder to Scale than Twitter

feralchimp · on Feb 13, 2012

> Posts are about 50GB a day. Follower list updates are about 2.7TB a day.

Holy schnikeys! The graph changes are two orders of magnitude larger than the content additions.

thesethings · on Feb 13, 2012

That caught my eye too, and I guessed on Twitter that it might be a typo/under-explained.

Somebody from Tumblr (though on a different team) was nice enough to respond and guessed that it was because all media on S3 was not counted.

https://twitter.com/#!/adamlaiacano/status/16914079173274009...

mcherm · on Feb 14, 2012

Thanks. It seemed completely unreasonable to me until I saw this explanation.

harryh · on Feb 13, 2012

This was very surprising to me as well. Assuming that IDs are 8 bytes (they appear to be) and that the mean post is 250 bytes (which seems conservative to me) then the mean fanout is something like 1600. That's big.

famoreira · on Feb 13, 2012

It seems that the only way to scale a site like this is to build decoupled services maybe using a messaging queue (AMQP, etc..). How can I learn more about this kind of systems? Also is there a open source application[s] built in a distributed way that I could learn from?

encoderer · on Feb 13, 2012

Exactly. If you don't know what Thrift is, check it out.

In addition to solving the problem space using simple, focused applications (in the classical Unix style), it solves other practical issues. Like PHP having shitty clients for most modern data stores (Cassandra, Redis, hell, even Memcached) compared to what you get in, say, Python, or a JVM language.

runjake · on Feb 13, 2012

I hate to point out the obvious answer, but it's a good introduction with links to several implementations:

http://en.wikipedia.org/wiki/AMQP

famoreira · on Feb 13, 2012

I meant some open source applications that are built in a distributed way (hence using something like AMQP in a publish/subscribe fashion)

rabidsnail · on Feb 14, 2012

I don't really know of any. Nobody writes these sort of distributed applications from scratch. They're almost always evolved from monolithic database-backed applications. And the process of evolution is half development half ops. I doubt that tumblr as it exists today could be deployed from scratch, even if you had all of the source code of all of the systems.

rabidsnail · on Feb 14, 2012

However, if you want to go truly distributed (much more so than any website I know of) there are open source projects you can learn from:

http://freenetproject.org/

http://amule.org (or http://mldonkey.sourceforge.net/Main_Page if you're into functional programming)

http://www.torproject.org (especially the hidden services stuff)

And you probably want to read this: http://pmg.csail.mit.edu/papers/osdi99.pdf

stock_toaster · on Feb 14, 2012

mongrel2[1] (uses Ømq[2]) maybe?

[1]: http://mongrel2.org/

[2]: http://www.zeromq.org/

gizzlon · on Feb 15, 2012

Gearman, briefly mentioned in the article, can probably help you as well. Let's you do cross language, cross platform (a)synchronous function calls . Looks cool, never used it..

http://gearman.org/ http://gearman.org/index.php#how_does_gearman_work

nessus42 · on Feb 13, 2012

My favorite part of the blog post:

"Don’t hire people based on their survival through a useless technological gauntlet. Hire them because they fit your team and can do the job."

grannyg00se · on Feb 14, 2012

Speaking of technological gauntlet, check out their software list:

Apache, PHP, Scala, Ruby, Redis, HBase, MySQL, Varnish, HA-Proxy, nginx, Memcache, Gearman, Kafka, Kestrel, Finagle Thrift, HTTP Func, Git, Capistrano, Puppet, Jenkins

I dunno, it seems they might do well to hire based on someone's ability to survive through a technological gauntlet. But the word useless makes the statement kind of....useless.

secoif · on Feb 14, 2012

If you've done any large scale deployment this is mostly bread and butter tech.

volkadav · on Feb 14, 2012

They've got at least three different programming ecosystems thrown together it looks like. Admittedly some of it may be legacy, but still most large environments I've had experience with have a bit more focused an approach. The more different moving parts you have, the more different things you have to head-scratch at when something breaks and you get a 3am page. ;)

ceol · on Feb 14, 2012

Yeah, I could see a personal project using a good chunk of those. (Apache, nginx, memcached, MySQL, Varnish, PHP) It's definitely not outrageous to think Tumblr uses them all.

striglia · on Feb 14, 2012

This seems like a useless statement. Surely no one is consciously failing people they know fit the team and can perform the job. The whole point of interviews is noisily assessing these things. Point me at a reliable, cost-efficient way of determining that and I'll gladly champion it.

willvarfar · on Feb 14, 2012

Perhaps their point is that HR and recruitment agencies do just filter candidates by "10 years redis experience" and most good candidates never get noticed.

I don't have C# on my resume. I wouldn't shrink from working on a C# codebase though. Would I get an interview though? Unlikely.

rhizome · on Feb 13, 2012

Those don't appear mutually exclusive to me.

michaelbuckbee · on Feb 13, 2012

Not mutually exclusive, but in some places, maybe not highly correlated either.

cshesse · on Feb 13, 2012

Especially if the job is to survive a useless technological gauntlet.

ArbitraryLimits · on Feb 13, 2012

Well, if you remove the word "useless."

jebblue · on Feb 13, 2012

Agreed and to me it sounds incredibly short sighted. People who have survived technological gauntlets know how to adapt and help instill a sense of continuity in fellow team members.

nessus42 · on Feb 14, 2012

I, for the life of me, don't understand why people defend these gauntlets. Wasn't getting a degree from MIT enough of a gauntlet? Wasn't writing reams and reams of working, robust, well-documented code proof enough of one's software engineering skills?

I think so!

freehunter · on Feb 14, 2012

That's what college is advertised to do: show that you can put up with tons of work and stick it out to see the end you wish to achieve even if the means are not what you had hoped.

jebblue · on Feb 14, 2012

Try interviewing, every 2 or 3 years the interviewing crowd has been influenced by their day's generational thinking (not a bad thing just something to be aware of); I for one always welcomed people who have passed through amazing experiences. Apparently from recent interviewing I was/am in the minority. Oh, I meant the gauntlet of experience, learning under fire; I'm not a college graduate.

Perhaps I need to use hair coloring.

fooandbarify · on Feb 14, 2012

I got the impression form the article that they were referring to passing through a technological gauntlet during the interview process (ie. complicated programming puzzles) rather than life experience. Is that also what you are referring to?

jebblue · on Feb 15, 2012

You're right, I misunderstood the context of the comment during the first reading.

nessus42 · on Feb 13, 2012

They aren't. And neither is hiring only people with red hair.

DLarsen · on Feb 13, 2012

> Initially an Actor model was used with Finagle, but that was dropped.

The Akka folks do a good job of keep the actor model on your mind when building Scala systems, but it's obviously not always the right approach. I'd love to hear more about how they ended up abandoning the Actor model.

retroafroman · on Feb 13, 2012

This may be a slightly naive question, but why list both "500 web servers" and "8 nginx"? Seeing as nginx IS a webserver, is it fulfilling a different function (serving static pages?) compared with the Apache servers?

simonw · on Feb 13, 2012

$ curl -i 'http://www.tumblr.com/ HTTP/1.1 302 Found Date: Mon, 13 Feb 2012 19:31:35 GMT Server: Apache P3P: CP="ALL ADM DEV PSAi COM OUR OTRo STP IND ONL" Location: https://www.tumblr.com/ Vary: Accept-Encoding X-Tumblr-Usec: D=15547 Content-Length: 0 Connection: close Content-Type: text/html; charset=UTF-8

$ curl -i 'http://assets.tumblr.com/images/favicon.gif?2 HTTP/1.1 200 OK Server: nginx/0.8.53 Content-Type: image/gif Last-Modified: Fri, 15 Apr 2011 22:13:30 GMT Accept-Ranges: bytes Content-Length: 635 X-Varnish: 1795992965 Cache-Control: max-age=2196224 Expires: Sat, 10 Mar 2012 05:35:30 GMT Date: Mon, 13 Feb 2012 19:31:46 GMT Connection: keep-alive

So it looks like their application is being served by Apache (and PHP), while their static assets are served by nginx behind Varnish.

silverlight · on Feb 13, 2012

Well, their assets are begin served from a CDN. Are they counting that in their server count?

  $ host assets.tumblr.com
  assets.tumblr.com is an alias for
  assets.tumblr.com.edgesuite.net.
  assets.tumblr.com.edgesuite.net is an alias for
  a1092.g.akamai.net.
  a1092.g.akamai.net has address 69.31.106.32
  a1092.g.akamai.net has address 69.31.106.50

retroafroman · on Feb 13, 2012

Good catch, that's along the lines I was thinking

riledhel · on Feb 13, 2012

Maybe the nginx servers are configured to serve static content, as reverse proxies or load balancers.

huxley · on Feb 13, 2012

Nginx can also act a proxy for mail protocols: http://wiki.nginx.org/MailCoreModule

B-Scan · on Feb 13, 2012

After fairly bad review about Tumblr's availability [1] this was something they needed to write. They had (and still have) pretty big challenges. It's not easy to scale something like that and this article confirms that.

[1] http://news.ycombinator.com/item?id=3468879

matth · on Feb 13, 2012

Off-topic, but I put together a list of HN members on Tumblr not too long ago: http://blog.dozierhudson.com/post/9596967319/list-of-hacker-...

Get in touch and I'll add you to the list.

willvarfar · on Feb 14, 2012

http://williamedwardscoder.tumblr.com

fs111 · on Feb 14, 2012

http://notes.kel.pe

thebluesky · on Feb 14, 2012

Always interesting to read about real-world usage of different tools and languages.

From the article: ... • MySQL (plus sharding) scales, apps don't. • Redis is amazing. • Scala apps perform fantastically. ...

Some of the organizations (Including Tumblr) using Scala were discussed as part of this talk: http://www.youtube.com/watch?v=qqQNqIy5LdM and slides here: http://mrkn.co/s/video_martin_odersky_what_s_next_for_scala,...

andybak · on Feb 14, 2012

I'm stunned by the traffic figures as I don't know one person that uses it (and I suspect the majority of my non-tech, Facebook loving friends have never heard of it).

Is that anything to do with my demographic (41, M, UK based)?

fooandbarify · on Feb 14, 2012

Yes (demographic). Although all sorts of people use tumblr, it seems to be most popular among (predominantly American) high-school age (and a bit older) teenagers. It has settled into a middle ground between Twitter and traditional blogs for a lot of these users -- single-paragraph posts or photo posts are the norm, and re-blogging is very common/encouraged.

antinitro · on Feb 15, 2012

I'm a bit younger (19, M, UK) and a lot of my facebook friends seem to use it. I think it appeals to a youngish demographic.

ck2 · on Feb 13, 2012

of the 500+ million page views a day, 70% of that is for the Dashboard

"dashboard" is their admin area?

So in theory 700 of their 1000 servers are for people to make a new post?

If I remember correctly, wp.com has a similar problem.

citricsquid · on Feb 13, 2012

A relevant comment of my own from a few weeks ago: http://news.ycombinator.com/item?id=3468904

Tumblr is not a blogging platform like Wordpress, it's much more a community like Reddit. Yes, you can use Tumblr to host your blog, but the majority of people use the dashboard to interact with others and use their blog to share their content. Very few people will ever see username.tumblr.com, they'll see the posts via tumblr.com/dashboard. Like Twitter, very few people visit twitter.com/citricsquid, they follow me and see my tweets in their stream.

For example, my dashboard right now: http://i.imgur.com/0YAYv.jpg (potentially nsfw material, didn't check)

skeletonjelly · on Feb 14, 2012

The list is NSFW depending on where you work for the curious. So I guess the dashboard is the feed. Not a graph based dashboard, nor an admin section.

thesethings · on Feb 13, 2012

Dashboard is more like a "reader." Everybody you follow is found there. Also any tags you subscribe to are there. Interactions like replies are also found there. I'd compare it to your Facebook feed or Twitter.com stream. It's your base of operations.

Lots of read-only content, and then a decent amount of interactions and events (messages/new follows/like notifications).

brown9-2 · on Feb 13, 2012

The Tumblr Dashboard is the equivalent of Facebook's News feed.

jonathanmoore · on Feb 13, 2012

The dashboard is where you primarily view all of the posts from the people you follow on Tumblr. Think of it in terms of what you view on Twitter, all of the Tweets from people you follow plus the ability to send out a new Tweet.

Avshalom · on Feb 13, 2012

The dashboard also aggregates all the posts from people you follow.

hungryblank · on Feb 14, 2012

You do not write an article about technology by vomiting hundreds of bullet points mentioning every technology a company ever used. It's an appallingly bad example of technical journalism. This is not informational, it's just bullet point driven bike shedding.

benjaminwootton · on Feb 14, 2012

He puts the articles together by researching from various sources, often without direct access to the company themselves.

It's going to be a little fragmented, but nonetheless, I found this a hugely interesting article.

toddh · on Feb 14, 2012

I understand that hungryblank. But I don't consider myself a journalist. I'm a developer writing for developers so they can hopefully learn something that will help them build cool stuff. To that end I use a very concise easily digestible style with as little extraneous verbiage as possible. Nearly every point in the article could be exploded into a long form article and I already get in trouble for TL;DR :-)

didip · on Feb 14, 2012

I hope they managed to release staircar(http://engineering.tumblr.com/post/7819252942/staircar-redis...) before phasing it out with Scala.

fauigerzigerk · on Feb 14, 2012

I'm skeptical about the inbox model.

I use Google Reader for two main reasons. One is that once I subscribe to a feed, I get access to all messages, not just new ones. The second is that I can search old messages as well.

So if I understand Tumblrs inbox model correctly, that's exactly the kind of usage pattern that isn't supported.

apgwoz · on Feb 14, 2012

I don't see how this is a problem. Every cell has a copy of every post--that should make it possible to search it and get all older posts as well.

fauigerzigerk · on Feb 14, 2012

Absolutely it should be possible. It just sounded to me as if they don't do it. But maybe I'm wrong. I'm not a Tumblr user.

fooandbarify · on Feb 14, 2012

The UI is very similar to a Twitter dashboard (if you are familiar with that). I don't think it's possible to search for old posts in either case, and the style of the site doesn't really lend itself to that kind of use (for better or worse).

unicornporn · on Feb 14, 2012

Aside from all this, can anybody imagine what their Amazon bill is like? Lucky them they have not so demanding venture capitalists to pay their bills. I honestly wonder how long they will be able to keep growing like they do without making any money.

willvarfar · on Feb 14, 2012

Which implies that disqus is getting hammered too.

Most blogs have disqus comments on each post; every page-view on tumblr is at least one fetch on disqus; many more for index pages.

Very clever strategy for tumblr to make comments someone else's problem ;)

Avshalom · on Feb 14, 2012

It's not that bad. 70% of views are from the Dashboard, they don't see disqus, and I suspect that "most blogs" is an artifact of the circles you run in. Anti-cdotely none of the blogs I follow have disqus.

democracy · on Feb 13, 2012

A bit surprising is the tone of apology when touching JVM even with Scala gloves.

soc88 · on Feb 14, 2012

Really? Didn't see that. Is this a political or a technical comment? When building highly-scalable and stable systems, there isn't much except the JVM. Sure, for some corner cases Erlang might be a solution and if reliability is less important also .NET.

wtn · on Feb 13, 2012

Do sites like Tumblr use graph databases?

jrockway · on Feb 14, 2012

What questions do they need to answer that require the entire graph? It seems like the most complex thing they may need to calculate is friends-of-friends, and that's easy to do even with an adjacency list in a SQL database.

wingo · on Feb 14, 2012

> New Tumblr > Changed to a JVM centric approach for hiring and speed of development reasons.

Interesting.

douglasfshearer · on Feb 14, 2012

It seems that in New York it was easier to hire people who had worked at scale with the JVM, since that is what the nearby banking institutions might standardise on.

Alind · on Feb 14, 2012

"Internally they had a lot of people with Ruby and PHP experience, so Scala was appealing. "

Anyone notice this ??? logical?

spullara · on Feb 14, 2012

Scala looks a lot more like Ruby than Java does. It makes people feel more comfortable with the language. Also, a lot of the patterns, like collections, are more similar between Scala and Ruby than Java and Ruby. This will change with JDK 8, but for now it is true.

its_so_on · on Feb 13, 2012

I'm sorry, but I have a hard time believing that anything is harder to scale than Twitter. With an audience of 175 million users, a Twitter field of 140 bytes makes for 22 gigabytes of uncompressed text per person-tweet. If everyone Tweets ten tweets per hour, that's 220 gigabytes. A one-TB HD would only be good for 5 hours or so, meaning Twitter would need to buy a good 3000 hard-drives per year. That's probably more than Amazon has in total.

recursive · on Feb 13, 2012

> If everyone Tweets ten tweets per hour

Well, luckily for them, that's not actually the case.

boyd · on Feb 13, 2012

Ya... and Amazon almost certainly has more than 3000 TB of storage (3 PB).

ceejayoz · on Feb 13, 2012

Hell, they've got pricing tiers for S3 customers that go up to 5PB.

its_so_on · on Feb 14, 2012

that's the part where I tried to show that I was absolutely being sarcastic. :/

its_so_on · on Feb 14, 2012

I was being so extremely sarcastic to show how ridiculous it is to say "harder to scale than Twitter". Of course there are nowhere near 175 million users, of course they don't tweet ten tweets per hour, and of course the tweets aren't uncompressed. My point is that even with all these assumptions, you have a ridiculous lack of scaling problems. My mention of amazon was meant to show that I was absolutely 100% being completely sarcastic when talking about Twitter's "scaling" problems.

whatusername · on Feb 13, 2012

I'm assuming you're being sarcastic -- but the "everyone tweets ten Tweets per hour" figure sounds like it's off by at least one (if not 2 or 3) orders of magnitude.

its_so_on · on Feb 14, 2012

Wow, my post is down 4 points.

Yes, I was being extremely sarcastic.

1) there are far fewwer than 175m active users of twitter 2) a tweet compresses to far less than 140 bytes. Text normally compresses to (easily) a third its size 3) Wikipedia says there are some 300 million tweets per day...or 13 gigabytes per day (compressed).

A single 1 TB hard-drive is almost certainly good enough to store a couple of months of tweets. A year of Tweets should fit on a couple of thousand dollasr of hardware.

and 4) I added the Amazon line so that you guys would know I was being absolutely, 100% sarcastic in every way possible. I guess the clue still wasn't enough.

My point is that there are very few things that are as easy to scale as a platform built on sharing 0.13 kilobyte tweets (0.04 KB compressed), in an age where rendering many common front pages requires a browser to download 600-800KB in static and dynamic content. Come on.

whatusername · on Feb 14, 2012

Sometimes the Internet gets sarcasm.... and sometimes we fail badly.

600-800KB +++ I was chatting with my brother in law recently about his website (fashion photographer). His homepage was about 5.6MB. When I asked him about it -- he figured it was normal. Sure enough -- he rattled off a few names for me to check. All of them had front-pages >6MB.

sanswork · on Feb 15, 2012

You're forgetting about the network effect involved in scaling something like twitter which is commonly overlooked by people judging services like that.

joshu · on Feb 13, 2012

sounds like they have some twitter envy?

on Feb 13, 2012

[deleted]

mnutt · on Feb 14, 2012

For 500M req/day, that's 6000 req/sec on average. That could plausibly fit with 40,000 req/sec peak.

fooandbarify · on Feb 14, 2012

I don't think you should have been downvoted, but you certainly missed something. days != seconds

gaius · on Feb 14, 2012

When I look at the amount of effort that goes into this, I have to wonder if it wouldn't have been better to just write it in C in the first place.

cagenut · on Feb 13, 2012

Tumblr is an extremely pageview heavy design, but the industry has moved away from PV as an important metric for a few years now. Fortunately they're nice enough to post their quantcast data publicly: http://www.quantcast.com/p-19UtqE8ngoZbM

They're still phenomenal numbers, but IMHO should be much closer to a 100-server environment than a 1000 server one.

jonknee · on Feb 13, 2012

100 servers for billions of requests per day seems crazy optimistic. It sounds like they're running pretty lean as it is.

NyxWulf · on Feb 13, 2012

It all depends on the nature of the request. Our nginx servers are tuned to burst to 7000 requests/second and run stable at 4800 requests/second for simple content. Our overall scale is about 1/5th of what Tumblr is doing daily and we have a much smaller infrastructure footprint. Although in fairness our content is no where near as dynamic as theirs is.

sanswork · on Feb 13, 2012

I've done ~1b/day on around 6-10 1gb/gogrid instances though admittedly the complexity was lower(ad serving platform) though not just a proxy/static server. When I read this I actually messaged my partner "Imagine what we could do with 1000 servers". I imagine a lot of those servers aren't directly related to serving the site though. The number of support servers required is usually way underestimated.

encoderer · on Feb 13, 2012

I think you're on to something when you contrast the 2 problem domains. Number of requests is a very naive way to look at load factors.

At the startup I work, we've got 25-30 Million users, many stats similar to Tumblr, and we're running it on about 250 Ec2 instances of varying size. I think if Tumblr's numbers are high at all -- due to rapid iterations and no time to focus on deep optimizations -- it's maybe 10-15% high, not 90%.

I'm saying this because I've seen periods where our usage numbers are somewhat flat, even falling, but our hardware demands rise as we provide more features. When there's just one or two primary ways to use a service (eg "I post status updates and comment on my friends' status updates") it can be quite easy to optimize. But add features. Photos. Chat. An in-house ad serving platform. I18N. Etc. You have different types of interactions with different acceptable service levels and varying storage requirements.

jaylevitt · on Feb 14, 2012

Exactly. My iPhone can process 800 million requests per second. It's just that those requests are to add two integers.

A request for static content != a request for dynamic content != a request for inter-user messaging != a request for a recommendation engine.

vintagius · on Feb 13, 2012

Its 2012 and we still embrace high Pageviews and architectures to handle them but ignore revenue.

Its high time everypost about insane pageviews and servers be coupled with solid revenue numbers.

citricsquid · on Feb 13, 2012

This is a post about architecture and scalability, not business.

ceejayoz · on Feb 13, 2012

It's a fairly shaky assumption that an engineer putting together a MySQL sharding presentation would know the revenue numbers for the company.

tomkarlo · on Feb 13, 2012

Even if they did, they shouldn't be disclosing them.

jcampbell1 · on Feb 13, 2012

15B pageviews with modest ad monetization would yield something around $15 million per month.

streptomycin · on Feb 13, 2012

$1 CPM is not modest, depending on the website. Many sites make much less than that.

sanswork · on Feb 13, 2012

Indeed $1CPM for this type of site would be quite high.

epc · on Feb 13, 2012

tumblr gets a cut whenever someone "buys" a tumblr design. Have no idea what that equates to in revenue but they are getting revenue.