> Should parts of the db be cached in ram and just used for writes etc. You ...

axod · on Jan 7, 2009

Often a database is used to do simple message passing - campfire, twitter, pretty sure they both use a db to pass messages around.

>> "You should not use flat files for web apps - they don't handle concurrency very well."

That's a silly blanket statement. If I have a single thread that deals with something, of course it can use a flat file to store it. The problem is, that some people decide to use webservers that cause concurrency issues by having multiple threads doing similar things for different users.

It sounds like you're still thinking about standard accepted many threads, database, etc etc.

ars · on Jan 7, 2009

Uh what?

If you have just a single thread of course you don't have to worry about concurrency. Isn't that what I just said?

And you are seriously making a webserver that serves just one request at a time?

There is a good reason it's "standard accepted" to use many threads, and a database. I guess you could have just one thread, with a queue, and do just one thing at a time without a database. Don't know why you would want that though.

axod · on Jan 7, 2009

My webserver is currently serving 5k clients in a single thread...

If you follow "standard accepted", you won't get anywhere.

ars · on Jan 7, 2009

Um, everyone's webserver can handle 5k requests (per day) in a single thread. Not sure how many requests your clients do per day since you didn't say.

Can you handle 1 million requests (per day) on a single thread?

By using just one thread you are serializing your bottlenecks (CPU, IO, network).

If you have more than one, one can be waiting for IO, while the other uses CPU.

Plus CPU speeds are not getting faster, the future is multi core.

abstractbill · on Jan 7, 2009

Um, everyone's webserver can handle 5k requests (per day)

He's not talking about the number of requests per day, he's talking about the number of simultaneous requests.

ars · on Jan 7, 2009

Maybe I'm not understanding something - how do you do simultaneous requests in a single thread?

And I knew he couldn't possibly mean 5K per day, but he didn't say how many it was.

abstractbill · on Jan 7, 2009

Maybe I'm not understanding something - how do you do simultaneous requests in a single thread?

http://www.kegel.com/c10k.html

I'm not too shocked you don't know about it, to be honest - not enough people do for some reason. Axod and I are lucky to have worked together on a very large-scale problem at a previous company that could never have been handled with a threaded approach.

I've since moved on to Justin.TV, where I wrote a single-threaded chat server that scales to over 10k connections per server cpu (we run it on many 8-cpu boxes). Axod is now the founder of mibbit, and he's obviously using a single-threaded approach there too.

ars · on Jan 7, 2009

I'm starting to understand what you mean.

You have one program, that handles multiple requests in the same program - but it's just one program.

As opposed to multiple programs, each handling one request.

I can see how that will handle any IO issues, and if starting a program has overhead, that will help too, but it still seems like it won't do a good job of keeping the CPU busy.

But you did say earlier that you were not CPU bound. All my websites have been CPU bound (well I think they are CPU bound), so I guess that's why I didn't get it at first.

axod · on Jan 7, 2009

They may well be CPU bound because of context switching.

abstractbill · on Jan 7, 2009

Right. Ars, here are a couple of points you may be missing:

- Adding threads "works" up to some small number (maybe a few hundred or so - depends on your platform). Then adding more threads just takes up more cpu without doing any useful work. Your program can seem cpu-bound, when actually you just have so many threads that none of them can do anything.

- The approach axod and I are talking about uses a single thread to service many network connections. Obviously you have to write your code quite differently to handle this: Your code is typically "triggered" by an event like receiving a few bytes over one network connection. You do a (very!) small amount of work, and then you quickly return control to the network core (called the "reactor" in Python's Twisted library). The core then waits for another event (e.g. more bytes arrive on a different network connection), and the cycle repeats.

Hope that helps.

ars · on Jan 7, 2009

It does, thanks.

I was letting apache do the threads, so under 100 probably.

Thanks for posting this - and staying on the thread. I should probably go back and re-read the thread now that I get what you are saying.

ntoshev · on Jan 8, 2009

I'd be curious to know what you think about this alternative architecture:

http://mailinator.blogspot.com/2008/08/benchmarking-talkinat...

Paul Tyma claims handling 40000 chat messages per second on a quad core desktop system with it.

axod · on Jan 8, 2009

I tried a similar sort of architecture at one point, the issue is that if you share blocking calls in a thread, at some point something will block that you don't expect to.

It's also just far far simpler to go with a single networking thread. Then pass off any cpu intensive, or long running tasks, or blocking tasks, to other threads.

axod · on Jan 7, 2009

The internet has made me sad. :(

http://en.wikipedia.org/wiki/Asynchronous_I/O

abstractbill · on Jan 7, 2009

Be happy, there was a time when we didn't know about this stuff either ;-)

axod · on Jan 7, 2009

Ah the days of threadpools, the cpu battling to context switch. The late nights battling to hold the cpu steady and add just another few threads, good times, good times :)

You are of course right though. Easy to forget...

axod · on Jan 7, 2009

I don't think this discussion is going anywhere, but for reference, my webserver does around 45 million requests a day, and yes, that's in a single thread. Webservers aren't typically cpu bound.

ars · on Jan 7, 2009

Thanks for the numbers, 45 million and 5k are kinda different.

I'm not trying to be argumentative, I have never done a very large scale website, and only now did I check your profile to see that you did.

But I still don't understand how come it's better to do a single thread. Also, all my websites have been CPU bound.

And even if it's IO bound, if you have more than one disk, adding a thread can only help, no?

axod · on Jan 7, 2009

Adding threads only helps if you have more cores, or if you're forced to use something that may block.

But even then, it's better to have a set number of threads doing different tasks, rather than one per user. eg have a network thread, a db thread(s) etc.

zmimon · on Jan 7, 2009

First: this is a really interesting thread and I have a lot of respect for your experience.

But is it really better to statically allocate resources to threads? You may have 8 cores on a box and 1 of them burning and 7 of them cruising. By utilizing a small thread pool and letting the scheduler spin things off dynamically you can turn that into 8 cruising instead.

Just curious.

axod · on Jan 8, 2009

The networking thread shouldn't be using hardly any CPU. If it is, then somethings badly wrong. It's better to have a single networking thread, trawling through the connections, moving data around, and have it talking to other threads that handle long jobs, cpu intensive tasks, or anything that might block.

abstractbill · on Jan 7, 2009

You are working way too hard. Let the OS figure out what needs to be in ram - it does it automatically anyway, and it does a better job that you an since it's caches what actually used, and not what you think should be used.

So why does memcached exist?

blasdel · on Jan 8, 2009

Because the kernel's cache is for things accessed via the VFS layer. It is extremely good at caching file accesses.

The kernel can cache the FS calls that the DB makes, but it can't cache calls to the DB!

ars · on Jan 7, 2009

> So why does memcached exist?

To cache the results of complicated joins (or queries without indexes).

It's pretty much pointless if all you are doing is caching the result of a simple query using an index.

abstractbill · on Jan 7, 2009

It's pretty much pointless if all you are doing is caching the result of a simple query using an index.

Not if your database is under heavy load, and you can easily shift some of that load by putting frequently accessed things in memcached instead.

ars · on Jan 7, 2009

That's if you have two machines. The comparison was vs keeping stuff in a hash table in memory, and I was saying databases are no worse.

abstractbill · on Jan 8, 2009

The comparison was vs keeping stuff in a hash table in memory, and I was saying databases are no worse.

But that's clearly not true. In the most extreme case, that hash table is referenced simply by a variable in your program - it's already in your program's address-space! There's no way a database can come close to that.

gnaritas · on Jan 8, 2009

Not to mention that you can hash arbitrary objects in a hash table with no mapping of any kind.

    hash at: key put: anObject

Databases are vastly more complicated and require me to completely disassembly the object graph anObject may contain into a set of tables and rows to store it, and then reassemble the graph from tables and rows back into their object form when fetching.

The second one commits to using a relational database, one often easily triples the size of the code base. There's nothing simple about that.

gnaritas · on Jan 7, 2009

> To cache the results of complicated joins (or queries without indexes).

Not true at all. The purpose of Memcached is to completely avoid a call to the database because the database, even if it keeps everything in memory can't touch the read speed of a distributed hash table. Memcached allows you to spread the reads across farms of boxes instead of sending them all to what is usually a single database server.