And I'm arguing that's going to be a hell of ride, which is more likely to break their backs than yield success.
I read up on the link you posted elsewhere (thanks for that, by the way) and well, it's just as bad as I thought.
Imagine, in pre-2.0 world, there's a greedy reader, and a subsequent writer queued up on the lock. All subsequent readers are blocked until writer quits, which can't quit before greedy reader does. This is a nightmare. That was before 2.0, now the greedy reader will yield, writer will finish, and all the pending readers will be unblocked. This is only an improvement if you consider the nightmarish previous situation. There are still two problems: 1) single writer blocks all readers on a shard while in progress 2) as soon as a new writer is queued up behind the reader lock, all subsequent readers are queued up again.
Does this look like optimal resource use to you? It does not to me.
Let's contrast this with "legacy" engines:
Sybase: readers/writer lock has granularity of a 8kb page. If you're not touching the same page someone else is writing you're fine. (*) they might have moved on since the 1990-s, I haven't looked.
Microsoft: reader/writer lock has granularity of a row. If you're not reading a row someone is writing, you're fine. That was in the 1990s, they have since moved on to snapshots, but have not yet made them default option, I think.
PostgreSQL or Oracle: 1) readers are reading a snapshot and never block writers 2) writers block each other, and granularity of locking is single row. If you're not writing the same row someone else is writing you're fine.
SQL Lite - readers do not block writers, there is a database-wide writer/writer lock. Note that this is a very lightweight desktop-oriented database, not a cloud solution.
MongoDB - reader/writer lock granularity is a shard, the part of the database apportioned to a single CPU core. If you happen to read data on the same shard someone is writing, or is planning to write you're not fine at all. Their plans are "collection level locking".
So I get it, you're saying they planned to add serious concurrency later. I agree on that - they planned. Where you and I disagree is that they will likely fail, because retrofitting concurrency is exceptionally hard. I just can't believe that anyone who knows what he's getting into would actually agree to get into this.
I understand you need to compromise something when you start out, but I think concurrency is the worst possible choice.
In Microsoft SQL Server row/extent/table locks are of transactional semantics and often turned off with nolock option. What really matters for concurrency is page latch, which is per 8K page.
In SQLite readers actually do block writers by default. Writing transactions are committed with lock escalation steps. First shared lock is acquired, then reserved, then pending and finally exclusive. Pending blocks new shared locks and waits until all in-flight shared locks are released. Again this is default rollback journal behavior. As of 3.7 write-ahead log allows readers to be concurrent with writers, but AFAIK is still rarely used.
I recently found out that turning on nolock is a horrible idea. It doesn't just take you out of transactions. That is, you won't just get uncommitted data, but apparently can get completely inconsistent data as internal structures are updated. That is, even rows not part of a current transaction might not be seen, if you use nolock.
My experience with nolock is that you may get inconsistent rows with some fields before and some after update. Or even seeing duplicated rows when b-tree is rebalanced. But I never seen single fields being partially updated. Per Microsoft, page latch protects atomicity of a single field update. This is why nolock was extremely useful for insert-only tables and in our database design we had many of them.
My position on mongo is thus: Its goal is humongous data sets, hence the name. Until well proven, I'm not the type to use it for huge data, but will keep an eye out for case studies.
I have used Mongo on two projects with reasonably small data sets. My largest collection at the moment is 5 million, and that's basically a log. Other collections are less than 100,000. I've been running mongo 1.6 for a year on these two sites without so much as a hiccup. I do the normal very simple things to protect myself: a cron job to dump db and then copy it to a backup server. And that's it.
I enjoy using mongo for these projects because when I want to add a new feature to one of my domain models, I don't need to think much about retrofitting the data for all instances of that model. I just add an attribute where its needed for the new use case, ensure I have basic checking in my ruby model object and my system keeps incrementally improving.
I think the mongo folks are fantastic in their open dev process and maybe one day, some threshold will be crossed where I can say that for certain types of big data usage mongo is a clear solid choice.
'I don't need to think much about retrofitting the data for all instances of that model. I just add an attribute where its needed for the new use case, ensure I have basic checking in my ruby model object and my system keeps incrementally improving.'
That's exactly the same as adding a new column to your DB with NULL as the default value.
Mongo has the notion of undefined and Null. You can just start putting the new field on new records without having to backfill. Also, you don't have to do the migration thing, which can get messy in big teams (from my experience).
Moving to a doc store from an RDMS really does bring with it an odd sense of freedom when it comes to the schema.
You don't need to do any 'migration thing', you just add the column to the DB and choose a sensible default value? I don't see what you gain by having both 'undefined' and 'null'.
The 'odd sense of freedom' is not always a good thing either. It's like BASIC allowing you to use a new variable without declaring it. It may be convenient but nobody calls it a good idea.
"You don't need to do any 'migration thing', you just add the column to the DB and choose a sensible default value?"
Taking the team I worked with at the BBC as an example:
1) There were staging, integration and production environments. Staging and integration would often not be aligned with production, or even one another, because we might find a bit of code turned out not to be production suitable/needed. If this happened we would have to drop the database back to a known, good state. You can't have columns with constraints left around when the code which might have satisfied those constraints is reverted. Doing it without migrations would have been idiotic to say the least.
2) Developers work on different features in different branches, often collaborating. Different features apply new attributes to the db schema. It's important for a developer to know his DB is in the correct state when he starts hacking. You do that with migrations.
Because you almost completely remove the need for schema definition (and what little of it you do, you can do in app. code) you simply don't need the migrations any more. Using mongo means you can pretty much just export your applications domain without having to coerce it into the relational model.
"I don't see what you gain by having both 'undefined' and 'null'."
They mean totally different things. Undefined means that the field has never been explicitly set, null means the field has been set. This means you know what's been backfilled and what hasn't - you can't tell without extra metadata in mysql. Also, in mysql if you provide a null default then every row has to be updated.
"It's like BASIC allowing you to use a new variable without declaring it. It may be convenient but nobody calls it a good idea."
I don't know BASIC but you can put Perl into a certain configuration that allows this. That makes for horrible scoping issues that aren't analogous or applicable to what we're talking about.
If you added a column and you want to revert back, you just drop the column again! What's so hard about that? No 'migration' needed.
In BASIC you can 'declare' a variable by simply using it. The compiler will not warn you if you use an undeclared variable. That's the analogous situation here.
"""My position on mongo is thus: Its goal is humongous data sets, hence the name. Until well proven, I'm not the type to use it for huge data, but will keep an eye out for case studies."""
Actually Mongo is bad for really humongous data sets.
It works well if the working data set (the data you commonly need) can fit in memory.
Of course this doesn't scale very well with say several terabytes of data, while there are Oracle databases that handle a lot more...
In the case you Mongo you go to sharding etc and things get complicated in your app handling.
And I'm arguing that's going to be a hell of ride, which is more likely to break their backs than yield success.
I read up on the link you posted elsewhere (thanks for that, by the way) and well, it's just as bad as I thought.
Imagine, in pre-2.0 world, there's a greedy reader, and a subsequent writer queued up on the lock. All subsequent readers are blocked until writer quits, which can't quit before greedy reader does. This is a nightmare. That was before 2.0, now the greedy reader will yield, writer will finish, and all the pending readers will be unblocked. This is only an improvement if you consider the nightmarish previous situation. There are still two problems: 1) single writer blocks all readers on a shard while in progress 2) as soon as a new writer is queued up behind the reader lock, all subsequent readers are queued up again.
Does this look like optimal resource use to you? It does not to me.
Let's contrast this with "legacy" engines:
Sybase: readers/writer lock has granularity of a 8kb page. If you're not touching the same page someone else is writing you're fine. (*) they might have moved on since the 1990-s, I haven't looked.
Microsoft: reader/writer lock has granularity of a row. If you're not reading a row someone is writing, you're fine. That was in the 1990s, they have since moved on to snapshots, but have not yet made them default option, I think.
PostgreSQL or Oracle: 1) readers are reading a snapshot and never block writers 2) writers block each other, and granularity of locking is single row. If you're not writing the same row someone else is writing you're fine.
SQL Lite - readers do not block writers, there is a database-wide writer/writer lock. Note that this is a very lightweight desktop-oriented database, not a cloud solution.
MongoDB - reader/writer lock granularity is a shard, the part of the database apportioned to a single CPU core. If you happen to read data on the same shard someone is writing, or is planning to write you're not fine at all. Their plans are "collection level locking".
So I get it, you're saying they planned to add serious concurrency later. I agree on that - they planned. Where you and I disagree is that they will likely fail, because retrofitting concurrency is exceptionally hard. I just can't believe that anyone who knows what he's getting into would actually agree to get into this.
I understand you need to compromise something when you start out, but I think concurrency is the worst possible choice.