>In the end it is merely a question of data normalization Can you elaborate?...

JanezStupar · on Jan 1, 2012

Yes this is exactly what I mean. Lets start from[1].

The kosher way of designing document oriented database is to simply embed everything within a document. This is great for operations such as full text search, retrieving whole objects, etc. We call this denormalized form.

On the other hand, sometimes we want to analyze our data, e.g. we want to extract subsets of data and view them in isolation. With completely denormalized form this is expensive, since we either need to manually touch each and every document and extract data or we need to maintain indices that help us out. Both are extremely resource intensive.

The third option is that we chop our objects into smaller objects and then link them together. But this will mean that retrieving a whole document will take longer (multiple database requests) it also adds an overhead of eliminating duplicates (two objects may appear identical, but really are not), etc...

Denormalization gives you horizontal scalability, but takes away ad-hoc querying. It also wastes storage (document size is minor issue, Indices however will kill you [2]).

Normalized data will take away horizontal scalability, give you ad-hoc querying, and save storage space.

In the end for any kind of nontrivial system, you will eventually reach a point, where you will need to maintain two storages - a normalized and non-normalized form. The only difference is what your primary problem is and this sets whether you start out from Normalized or Denormalized storage, this will be your primary storage and source. The other kind will be an offline slave that will offer secondary functionality.

E.g: If you start out from relational data and you want to build a FTS, you WILL have to denormalize data. On the other hand, if you start from Object/Document store and want to offer ad-hoc analytics, you WILL have to normalize your data. Its good to keep it in the back of your head.

[1]: http://www.mongodb.org/display/DOCS/Schema+Design [2]: An application I worked on had 2GB (100 million documents) worth of data, however completely indexed database would take 25GB of storage and Index rebuild would take ~8 hours.