More

sergei · on Nov 26, 2013

What do you base your performance claim vs Clustrix on?

mtravis · on Nov 26, 2013

Here is some back of napkin analysis:

Starting with this benchmark report: http://www.percona.com/files/white-papers/clustrix-tpcc-mysq...

Basically, InfiniSQL does not currently support complex indices, so it can't do a TPC-like transaction.

The maximum throughput on 9 nodes is 128,114 per node per second. I don't know if that's 4 or 8 core nodes. If roughly 10% of transactions are multi-node transactions, then 12,811/node/s for multi-node, and 115,303/node/s for single-node transactions.

I don't know if full redundancy for Clustrix was configured, or a hot spare, so I don't know how many actual usable nodes were configured, but likely fewer than 9. But I don't know the precise number.

Roughly 10% of those transactions are contain records on multiple nodes. Based on 9 nodes, that means about 12811/node/s for distributed transactions combined with 115303/node/s for single node transactions.

InfiniSQL maxsed at over 530,000 multi-node transactions on 12 x 4-core nodes. http://www.infinisql.org/blog/2013/1112/benchmarking-infinis...

That's 44,167 per node.

---------

These were not apples-apples benchmarks, but Clustrix performed about 12,000 multi-node transactions per node per second, along with a whole bunch more single-node transactions.

I don't know how it would perform on the benchmark I used. And I intend to do a tpcc benchmark once InfiniSQL is capable of complex keys (among whatever else it it currently is missing).

sergei · on Dec 7, 2013

Several problems here:

1. Unlike your dataset, the tpcc dataset for the benchmark was not memory resident. Total dataset size was 786G. Just shrinking the dataset to fit in memory would substantially change the numbers.

2. The tpcc workload is much more complex than your benchmark. It doesn't make sense to compare a tpcc transaction, of which there are multiple flavors, to your workload.

3. All Clustrix write transactions are multi-node transactions. There's a 2x data redundancy in the test. We do not have a master-slave model for data distribution. Docs on our data distribution model: http://docs.clustrix.com/display/CLXDOC/Data+Distribution.

-----

Now we've also done tests where the workload is a simple point updates and select on memory resident datasets. For those kinds of workloads, a 20 node cluster can do over 1M TPS.

sergei · on May 16, 2013

A decent cost based optimizer will be right most of the time, and you have hints for the cases where it falters.

Your approach is simply hiding the query optimizer choices inside of the app, making it more fragile.

Let's say I have two fields: a and b, both of which have an index. And then I have a query with predicates across both a and b. For example, a = 10 and b = 20.

One of the core facilities within an optimizer is selectivity estimation. By looking at the statistics, the optimizer will see that a = 10 might look at 10,000 rows while b = 20 might only look at 10. So the optimal and desired choice would be to use index over b.

However, the exact same query construct with different parameters (e.g. a = 50 and b = 3) might flip the index selection.

Now let's imagine I had to implement this inside of my app. Every time I have such a constraint, the app has to become aware of selectivity to know which indexes to use based on input parameters.

btilly · on May 16, 2013

You are underestimating the necessary design tension. If you expose hints, then you've constrained how your optimizer works because it has to work with those hints. (Oracle, for example, has been fighting this battle for ages.) Once you've built an optimizer, people come to depend on it so it is critical that you get it right.

They will need optimizer eventually, and they know it. But I'm glad that they are not implementing it before they are ready.

Secondly your selectivity estimation point can go either way, and on the whole I don't like it. I've personally experienced the situation where a database recomputes statistics, the CBO decides that a new query plan is needed for a common query, it made a poor choice, and the first that any human hears about it, the site is down.

The problem here is that the risk profile for the application of trying to be smart here is completely backwards. In general, as long as a query is not a bottleneck, I don't care about making it faster. Oh, you made it 2x faster? I didn't mind before and I'm unlikely to even know that you did so. But if you JUST ONCE switch to a bad plan on production without warning, your users WILL notice, they WILL care, and they WILL NOT be happy.

As a developer, I don't care that you make the right choice 95% of the time. I want you to make a predictable choice. Because if you're making a bad choice in development, I've got a window of opportunity to notice and do something about it where nobody cares. But if you randomly make changes on production, every single mistake counts.

Oh, but you say that this just means that you need stored query plans? I agree, and this is an example of why the behavior of the optimizer has to be thought through very carefully before you just throw something out there, people come to depend on it, and then you realize that you have put barriers to thinking of it the way you want to think of it.

sixbrx · on May 16, 2013

This accords with our experience with large Oracle databases as well.

We've had a few panics caused by a DBA having updated table statistics when trying to optimize some query. This would occasionally cause sudden, massive changes in the way that other unrelated queries were performed, which queries would not finish and sometimes bring down the database.

These experiences caused us to have to change our procedures and keep the test database data closely in sync with the production data, so we'd know how newly gathered statistics would affect queries. The database is large enough that having testing be a full and recent copy of production is pretty painful. Oracle has since introduced features in 11g that allow pinning query plans, we've yet to try these though.

btilly · on May 16, 2013

I wish that databases did "speculative" queries.

In the simplest form, you could say, "I have 2 reasonable plans, let's try A, and if it takes above time X, then start B in parallel and go with whatever finishes first."

You could ramp up the idea to handle changing query plans based on updated statistics by sending some fraction of queries to one plan, and some to another. Then keep stats on how that worked out for you.

Basically never simply flip the switch in production to try a new, unproven query.

Incidentally 12g advertises that they will actually collect statistics during query execution and based on those will validate the query plan. If that works, then this problem should get better. But of course that comes with overhead, and is likely to be a complex piece of code, so you tell me whether you trust them to get it right right away.

hendzen · on May 16, 2013

I'm sure you'll appreciate the irony of this, but what you described is pretty much the exact mechanism of the MongoDB query optimizer [0].

[0] - http://docs.mongodb.org/manual/core/read-operations/#query-o...

jeffdavis · on May 17, 2013

A cost-based optimizer also needs a way to handle the combinatorial explosion of possible plans. For MongoDB, maybe they can be exhaustive or use a simple strategy. But for a SQL DBMS, the search space is way too huge, and you need a way to navigate it intelligently.

This "try the plan out" idea has come up many times, and it's a reasonable idea, but it's nothing new and not a complete solution. You need a way to evaluate whether a plan has a reasonable chance of being the best one before the whole plan is even constructed, much less executed.

Combinatorial explosion is also one of the reasons that humans generally aren't great at coming up with non-trivial plans, either.

A good optimizer is a combination of many good ideas; so I don't mean to be discouraging, just that one idea is not the full story.

If you are worried about the risk of a changing plan, you are between a rock and a hard place. If you lock the plan down (or hard-code it in the application), and the data characteristics change, that could be disastrous. If you don't lock it down, that's a leap of faith that the optimizer will adapt gracefully.

hendzen · on May 17, 2013

I understand the issues you've brought up in your post as well as the traditional methods of pruning the search space of query plans through using left-deep join trees and histograms for calculating selectivity.

My top-level point was that there is no way humans will come up with the optimal query plan by hand, and like you said even if they do - the data will change.

As an aside, the reason Mongo can use such a simple but effective strategy is because it doesn't support joins, which makes calculating the optimal plan much, much easier. RethinkDB does, however, and as such, a CBO is that much more important.

btilly · on May 17, 2013

I am not disputing the value of a cost based optimizer.

I'm just pointing out that if you have an existing plan that seems to be currently working the default should be to not just switch it up without testing that your switch didn't hurt anything. Instead you need to use care.

btilly · on May 16, 2013

The ironies are indeed rich. Thanks for that!

AlisdairO · on May 16, 2013

You can accomplish something pretty similar to this with oracle plan management: http://docs.oracle.com/cd/B28359_01/server.111/b28274/optpla... . You can set this up to pin plans, but let you know when it thinks it has a better plan to try out.

btilly · on May 16, 2013

Yup. This kind of complexity they have been forced into because the simple general solution that they think should work and which does the vast majority of the time also routinely causes pants on fire emergencies.

pkolaczk · on May 17, 2013

DB/2 does that for ages.

sergei · on May 16, 2013

I frequently see apps with thousands of distinct query signatures. Having a developer manually chose indexes, join ordering, aggregation method, (just to name a few) for every single query, and then select multiple plans because input parameters absolutely do result in scenario where plans can be 10,000x off in performance -- well, that's just untenable.

Not trying to claim that CBOs are the panacea here, but let's be realistic. Having developers manually plan every single query is not the right choice.

btilly · on May 16, 2013

There is a logical fallacy in what you've said. It absolutely is true that there are times when input parameters can be a factor of 10k in performance. It is also absolutely true that applications can have thousands of distinct query signatures.

What DOES NOT follow is that most of those distinct query signatures are very important to the application. A few are. But most are not. However volunteering to add the wrong one can take the application down.

Having developers manually plan every single query is not the right choice.

I agree. However the many applications successfully built on MySQL demonstrate that for a lot of people, crappy but consistent planning is good enough. At least then people can know where the problems are, and fix them.

coffeemug · on May 16, 2013

slava @ rethink here. I don't think we disagree -- optimizers are unquestionably a good idea and are immensely useful. For us it was a matter of a) giving people the option to specify indexes directly, and b) shipping quickly. We'll implement a proper statistical optimizer in due time, but for the time being we found that the explicit approach gives a lot of people 80% of what they need with 20% of the work on our part.

jdoliner · on May 16, 2013

An optimizer can be right most of the time but it's basically impossible to have it be right all of the time. It really depends on what you're doing as to which approach you prefer. If you absolutely need a query to have a certain performance characteristic then you can't count on the optimizer to get it right a human needs to think through exactly how they want this query to run to ensure it behaves correctly.

There definitely are several use cases where people are willing to sacrifice a risk of misoptimization in exchange for shorter code and we fully intend to support that (it's a complicated project though so we don't now.) However I disagree that having this logic in the app makes it more fragile. Having the precise semantics expressed in the application means you always know exactly how a query is going to be performed. We know this execution plan isn't going to change based on seemingly unrelated properties of the data (as it does with optimizers.) If fragility is the propensity to break then I'd argue having the optimizer in control rather than the developer gives you more fragility than less.

josegonzalez · on May 16, 2013

Which is good, because now your developers don't just add queries to the application without thinking about performance.

You don't have to be a dba to understand how using one index vs another will affect the performance - and thus conversion rate - of your application.

sergei · on May 16, 2013

In practice, you end up with apps which dynamically build up predicates in different sections of the code. And when you combine that with many predicates, many tables, and other constraints such as ordering or aggregates, things get complex pretty quick.

Even if you have the best developers who understand all the in an outs of the dataset, re-implementing an optimizer in the app is rarely the right choice.

dennis82 · on May 16, 2013

no, it's categorically not good to make an application more fragile. A weakness restated is not a strength. Every database should have an optimizer, period.

pkolaczk · on May 17, 2013

If a database system does not support joins, aggregations or subqueries like most realtime NoSQL solutions do, an optimizer becomes pretty trivial. Optimizers are needed for analytical stuff. That's why most optimizers are evaluated on analytic workloads (e.g. TPC-H, TPC-W) not transactional / realtime (TPC-C).

josegonzalez · on May 16, 2013

I did not state that an optimizer should not exist for a database - I think thats key actually - but rather that the tradeoff they made this time around was fundamentally good in that - at least for now - it forces the developer to think about application performance.

If that happens to make an application more fragile, I think that is more of a code organization/tooling issue than anything else.

sergei · on May 7, 2013

Check out a previous comment thread on the subject: https://news.ycombinator.com/item?id=5068489

sergei · on April 10, 2013

Check out a previous comment thread on the subject:

https://news.ycombinator.com/item?id=5068489

sergei · on Jan 16, 2013

Yes, but they are in-memory only. I think the largest database they support is 50GB.

It's the kind of solution that works really well if you have key-value store problem where you want really low latencies. Telco call session state and session state for gaming comes to mind.

But beyond those use cases, you start running into a lot of architectural limitations...

sergei · on Jan 16, 2013

1) Both joins and aggregates are processed in parallel. In short, the larger your cluster, the faster we evaluate a complex query because we can bring more hardware to the problem.

You can get a better idea of how we scale distributed joins from one of my blog posts:

http://www.clustrix.com/blog/bid/242790/Scaling-Distributed-...

Once we complete the join, we will also perform the aggregate SUM operation across multiple nodes, getting individual sums, and then forwarding the results to another node for a final tally.

As a rule of thumb, each node has the throughput of about 70% of MySQL on the same hardware. So if you have a complex query that can take advantage of the cluster (100k rows per table examined qualifies), then you will see a 10x speedup on a 16 node cluster over a single instance.

However, that assumes that your entire data set fits in cache on the single instance box. As soon as that's no longer true, it's more likely that you will see a 100x or more speedup from Clustrix because you will have way more memory available for the buffer cache.

For example, if you have a 32GB single instance MySQL server, but you have more than 64GB of working set size, you're going to be limited by IO.

On Clustrix, if you move this to say a 4 node cluster, you are going to have an effective cache footprint of 128GB. So Clustrix will operate entirely out of cache.

2) We're fully ACID compliant using a two phase commit w/ PAXOS for transaction resolution. Because we're MVCC, readers never have to take out locks. To coordinate WRITE access to the same data, we have a row level distributed lock manager. That means that the entire cluster participates in lock management.

So yes, this has a higher communication overhead over a single server. If you have a small write workload with very little concurrency, you will see better query latencies from a single instance database.

However, as soon as you start getting to more serious write workloads with higher concurrency, Clustrix will distribute your writes across several machines in the cluster. We can handle an aggregate write throughput you can't touch with a single instance database, even with a fancy/expensive storage array.

One of our customers, for example, has a peak throughput of 70,000 transactions/sec. They're about 90/10 read/write, so that breaks down to 10,000 TPS for writes. And they are really fond of multi-row updates, insert .. ignore, insert .. on dup update kind of queries. So it's not exactly a lightweight key-value application. Prior to Clustrix, their MySQL installation peaked out at 5,500 TPS total (read+write). They now have a 15 node Clustrix cluster.

It's true, you can always construct a workload that will not distribute well. But these are generally rare degenerate cases that stem from poor application design, and would perform equally poorly on a single instance database.

crazygringo · on Jan 17, 2013

Very interesting. Thank you!

fkdjs · on Jan 17, 2013

How do you handle distributed joins with group by's? Are there certain SQL syntaxes which aren't handled well by queries? What are the 90th percentile response times?

sergei · on Jan 16, 2013

It's a good question. We built the entire database from the ground up, without any MySQL code. So while we support most of the frequently used features, we don't support everything.

You can get a list of unsupported features and differences from our docs:

http://docs.clustrix.com/display/CLXDOC/Unsupported+Features

http://docs.clustrix.com/display/CLXDOC/Feature+Differences

Having said that, our customers find that the value that they get out of scale out, fault tolerance, and performance is worth the tradeoffs of not supporting every single MySQL feature.

Honestly, our goal is not to be 100% MySQL compatible. We're aiming to be compatible enough to capture a significant share of existing MySQL installs, but we're really after building a truly scalable relational db.

EugeneOZ · on Jan 16, 2013

Thanks. In your blog I've found: "The database has native support for dealing with JSON documents."

Why it's not in the features list? I thinks it's very cool.

sergei · on Jan 16, 2013

Try it. You'll like it :)

thspimpolds · on Jan 16, 2013

++

Worked for an employer who used it. FAAANTASTIC. I met with the entire Executive team out in SFO on Friday, great company. I have never seen anyone take such care in doing massive QA (including month+ long regression testing on every generations of their platform)

sergei · on Jan 16, 2013

It's different from a typical sharding approach (including what MongoDB does). In their model, you take a single key and distribute your data using that key (e.g. user_id). The problem surfaces when you look at secondary indexes.

If you have a secondary index say on user_location, and you want to query by that index, you don't know which shard to go to. So you end up broadcasting.

Another problem is enforcing unique index constraints.

With Clustrix, every table and index gets its own distribution.

So if you have a schema like this:

foo(a, b, c, d) unique idx1(b,c) idx2(d)

Clustrix treats each table and index as a different distribution. So if I need to look something up by d, I know exactly which node has the data. I can also enforce index uniqueness.

polskibus · on Jan 16, 2013

I saw claims that Clusterix is very good for OLAP applications. Can you shed more light on it? Does it support materialized views for instance (to speed up calculating aggregation on higher levels) ?

mjw00 · on Jan 16, 2013

We don't support materialized views at the moment, though I can't think of a reason why we couldn't support them if needed. A big strength for our OLAP performance is simply having CPU and memory resources that scale with storage and a query planner that is smart enough to take advantage of those resources.

sergei · on Jan 16, 2013

Some reading material on the distributed approach:

http://www.clustrix.com/blog/bid/242790/Scaling-Distributed-...

http://www.clustrix.com/Portals/146389/docs/clustrix_a_new_a...