It's different from a typical sharding approach (including what MongoDB does). In their model, you take a single key and distribute your data using that key (e.g. user_id). The problem surfaces when you look at secondary indexes.
If you have a secondary index say on user_location, and you want to query by that index, you don't know which shard to go to. So you end up broadcasting.
Another problem is enforcing unique index constraints.
With Clustrix, every table and index gets its own distribution.
So if you have a schema like this:
foo(a, b, c, d)
unique idx1(b,c)
idx2(d)
Clustrix treats each table and index as a different distribution. So if I need to look something up by d, I know exactly which node has the data. I can also enforce index uniqueness.
I saw claims that Clusterix is very good for OLAP applications. Can you shed more light on it? Does it support materialized views for instance (to speed up calculating aggregation on higher levels) ?
We don't support materialized views at the moment, though I can't think of a reason why we couldn't support them if needed. A big strength for our OLAP performance is simply having CPU and memory resources that scale with storage and a query planner that is smart enough to take advantage of those resources.
If you have a secondary index say on user_location, and you want to query by that index, you don't know which shard to go to. So you end up broadcasting.
Another problem is enforcing unique index constraints.
With Clustrix, every table and index gets its own distribution.
So if you have a schema like this:
foo(a, b, c, d) unique idx1(b,c) idx2(d)
Clustrix treats each table and index as a different distribution. So if I need to look something up by d, I know exactly which node has the data. I can also enforce index uniqueness.