Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Heroku Postgres: SQL Database-as-a-Service (heroku.com)
112 points by tilt on Nov 22, 2011 | hide | past | favorite | 57 comments


I've seen a lot of "Cloud" database services lately, and while I love the general idea, from what I've seen in real life, for any serious db use, the impact of network latency is a HUGE problem. So unless this service happens to be residing in the same data center as your application it doesn't seem like a viable option.

The difference between a local DB with <2 ms latency to a remote DB with ~50 ms latency is huge. I've seen application start times go from 1-2 minutes to 20-30 minutes by pointing at a remote DB.

So if your app is even somewhat DB intensive, you really need <2 ms latency to your DB, and if it's not DB intensive, you probably don't need all the scaling and other infrastructure strengths these services bring to the table.


Pardon me, but if an app's operations time will go from 1-2 minutes to 20-30 by changing from a local to remote DB then you've either got an utterly abysmal network or enormously too many round trips to the database and the app could benefit hugely from being rewritten to move critical functions to running in SQL directly on the database server.

Now, I'd still rather have a local database but if you do that properly then almost all pages in your typical webapp shouldn't require more than a single round trip. IMHO.


FWIW: Remote DB was ~50ms away. Large scale eCommerce applications tend to be DB heavy-ish. Have a complex catalog structure with a few million SKUs, add in a few tens of millions each of users, orders, coupons, and push thousands of orders an hour, and you have a LOT of DB traffic. Add in things like real time inventory checks on product pages, dynamic shipping option/cost calculations based on the user's address (if logged in) or IP address (if not logged in), determining applicable coupons, cross-sell, up-sell, CTAs, dynamically based on the user's history and profile data, etc... You make a lot of db calls. Even caching stuff on the app tier you have to load that stuff in from the db at least once, etc... Again this is all dependent on your application and needs, but saying you should rewrite it to move more logic into the DB isn't really useful in many scenarios.

If you need things like dynamic cloud scaling, read only slave replication, etc... chances are you're doing a lot of DB transactions as well, and that latency can and will kill you.


No reason you can't do an inventory check at the same time as returning the basic catalogue details, it won't be more accurate in a separate call 50ms later. If you're generating the page for a logged in user then you know who that logged in user is and what they're asking for, so you can calculate coupons and the like in the same call too without needing to resend the same parameters to the database to look up, and if not feed in a different parameter for their IP address.

Now, I'm sure there's all sorts of other things in apps beyond my experience where you're better off doing multiple calls to the database, but certainly nothing you've outlined in those two couldn't be handled by a decent database infrastructure in a single call returning (potentially) multiple results sets.

One thing I learnt many years ago in data intensive apps. Communications latency between your app and your database will kill performance if you let it. Shipping data back and forth repeatedly will hang, draw and quarter it. Absolutely, aggressively, pare the number of external calls you have to make to the bone and you'll see a very significant performance boost.


Still, you are talking about round-trip overhead.

As a counter-example, LedgerSMB is very databse-intensive but we do what we can to make sure round-trips are minimized. This means making sure that everything that needs to be queried together is queried together, in the same query.

I suppose if you are doing a lot with ORMs and the like though that may not be an option.


I think this is enormously exacerbated by roundtripping. Even if it's just 2ms away, and you go back and forth a few dozen times, you are doing a ton of waiting.

This is one of the problem with joinless NoSQL databases, as well, since you HAVE to roundtrip to join.


You don't have to. For instance, in CouchDB, you can use view collation to retrieve related documents in a single request.


That sounds like a join by another name :)


If you use the term loosely, I suppose so. :)

But you're not actually joining any data. You return one document, and then exploit the properties of sorting to return the related document(s) next in the collection. In SQL-speak, each relation comes as a distinct row and are only connected by the order in which the rows are returned. There is a writeup on the specifics here, if you are curious: http://wiki.apache.org/couchdb/View_collation

It is a pretty clever trick that can save you a round trip to the DB, but doesn't have nearly the power that a real join does.


you can do the same in PostgreSQL with joins.

For example:

select i.* , array_agg(ROW(l.*)::text)

from invoice i

join invoice_line l ON i.id = l.invoice_id

where i.id = 1023;

This returns a single row for invoice, along with an array of tuples representation of all invoice lines. You could also use casts and stored procedure languages to convert to xml or whatever if you want.


Given that these databases run in AWS, there's no reason you can't run your application in the same datacenter.


well, except that AWS doesn't let you coordinate AZ's between customers [1]

[1] http://aws.amazon.com/ec2/faqs/#How_can_I_make_sure_that_I_a...


Maybe not officially, but I believe "is in the same AZ" is an equivalence relation, so once you've figured out the appropriate permutation function...


I get 1 ms or lower latency between AZs.


I get about 1.5ms to 1.6ms ping latency between AZs and about .3ms to .4ms between nodes in the same AZ (west-1).


I think you are correct.

I'd guess that the major target of this would be applications already running on AWS (which would put them mostly in the same datacenter). Obviously, a bunch of caveats about availability zones, etc.


Well, Postgres can shine here if you can do a fair amount of processing inside of a stored procedure. The sproc takes care of lifting, the client can be (fairly) limited in the transformations it applies, and the network latency of repeatedly going to the DB is lessened.

This is, naturally, very application-dependent.


Not just Postgres; any database server in which you can do either stored procedures or submit arbitrary batches of SQL statements rather than just individual queries will let you do this. I routinely do in MS SQL Server, for example, and have worked on multiple projects where the ability to do this was utterly critical to overall performance.

To be perfectly honest, I'd consider any database server where I couldn't do this to be something of a toy because of the restrictions it places on overall app performance. It might be OK in SQLite or Access but a real database? Sorry, no, come back when you've finished the thing please.


Indeed, you can do a fair amount of processing without even hitting stored procs in Pgsql. Plain SQL gives you tremendous power in Pg. I find stored procedures add useful semantic sugar around these, but especially with newer versions things like common table expressions, arrays of tuples, and the like, gives you tremendous power.


I tend to agree with this sentiment, though for some read heavy applications you can mitigate this problem via a distributed cache layer in your web tier.


Disclaimer: I own Cloud Postgres

Soon to be released: https://www.cloudpostgres.com/

Lower prices

PostgreSQL 9.1

Optional PGPool in front of the replicas for load balancing

Multiple datacenters


I think this is amazing, and immediately had a project in mind for this. But then, I looked at their pricing: $200/mo at the lowest price point.

It's still incredibly affordable, but not for what I had in mind. I wish they had lower levels so I could play.


I'm guessing that they will shortly as it seems like it is a big market hole to plug. I'd bet that since the lowest priced dedicated DB for a heroku app is $200, they've just moved that functionality over.

There's probably some complexity in getting shared cheaper DBs available that we don't know about.


I wonder if they have any "special sauce" in running PostgreSQL on EC2, or if the real value is in the easy provisioning? We're running PG 9.0 on EC2 now via EBS, and everything I've read says it's tough to get it performant - even with large volumes and RAID 10, you still get spikes when your virtualization neighbors slam the disk.


Someone at Heroku wrote a very useful article a few years ago on getting around the pain points of EBS, using RAID on EBS, etc. (This is probably the article I remembered:http://orion.heroku.com/past/2009/7/29/io_performance_on_ebs...)


I/O performance is still abysmal even on their largest instance type when striping over 20 EBS volumes (I benchmarked that a while back). And we still noticed a high variance in I/O performance during peak hours despite the striping.

Any recent dedicated server with 16 spindles will completely annihilate EBS in terms of IOPS. And that's with plain old spinning rust - add some SSDs and EBS ain't even in the same ballpark anymore.


Considering their "Continuous Protection" feature, could they be streaming a write log out to S3? I think that way they could use the backup as the sole guarantee of durability, and use instance storage for the database.

EDIT: also, I'm not sure if this is generally known elsewhere, but might they not be on ec2, they are advertising 1.7 GB cache, which is the same as a small instance, except without any room for the operating system or postgres itself.


You're exactly right about log streaming to S3. We open sourced the tool that we use to perform that: https://github.com/heroku/wal-e


S3 doesn't support continual logging - you need to decide on a file size up-front. Are you creating a new object for each log entry, or is there a window in which commits can be lost before the log gets flushed out, or do you have some other workaround...?


There's a window of a couple minutes in which data could be lost (in the worst case scenario).


So it's postgres log shipping, with S3 as the target?


Yup.


So would I need to (learn and) use PG9's replication with this? Or is one ephemeral-stored database plus shipped logs sufficient, assuming I can live with some downtime if the instance disappears?

As an experienced techie but a PG novice, what do I need to go learn about PG recovery procedures to safely use this?


I wouldn't use EBS. Instance storage is much more consistent, in terms of latency and performance.


I didn't realize that; in fact, I thought I saw something that claimed the opposite. It'd be great to see someone write up a HOWTO for WAL-E and PG9 replication on ephemeral stores in EC2.


I wish that they also sold their minimal shared 20GB $15/month database plan as a separate service (i.e., being able to access it from your own EC2 instances and not just from Heroku deployed apps).

I have a use case for a low volume app that I would like to host on Heroku, but have other software also have access to the database, without resorting to the obvious workarounds.


Cheapest plan: $200/month. That seems a bit high... what competes with this?


It's some money but I'm definitely interested. Scroll down and you see some interesting things:

- fork of databases (looks great to try out migrations easily)

- read-only asynchronous replicas

- connection from EC2 or elsewhere

- health checks and "continuous protection"

Looks fairly solid at first sight.


   what competes with this?
http://xeround.com is a separate DB service, also heroku plugin. It is MySql-compatible, apparently they've wrote a custom distributed engine. Its is located on AWS or rackspace, so latency is quite low.

Since xeround is pay-as-you go, if your initial DB requirements are low, you can run it on something like http://fluxflex.com (which heroku-like PaaS, also AWS-based) and have 10 db-backed apps for something like $3 per month that will scale up when needed.


When you think about all the setup, SysAdmin stuff, backups, scaling, etc you'll avoid, $200 a month ($2.4k) is pretty cheap to be able to just switch a scalable db on at a moments notice.


$200 is cheap but I'm hoping they eventually put out an option closer to the $40/mo price point just so I can play around with Postgre (I'm currently a MS SQLServer guy that wants to start branching out). I don't want to spin up a new AWS and waste time with configuration and other things and don't want to worry about hosing a currently running instance just to have a new sandbox to play in. I'd pay a reasonable amount to have something akin to a cheap "test" instance in exchange for giving up things like an SLA or dedicated support.


If you want to just play around with it, why not download and install it? No charge for that.....


Their shared DB hosting is free or 5MB and $15/mo for 20GB. It's not going to have all fancy features of the dedicated (fork, follow, etc) but it is Postgres at least.


http://aws.amazon.com/rds/

I believe Heroku used EC2 under the hood before the Salesforce acquisition.

Whether postgres (vs MySQL), the polish and better ui/workflow of heroku is worth the premium over RDS depends on you.


Heroku dynos are still ran on EC2 instances, exclusively on the East coast of the US.

I'm not sure if these postgres servers are. However one interesting note is that the "maximum database size" of the heroku server is 2TB, while RDS is 1TB.


Seems high compared to what?


Seems high compared to the cost of a comparable EC2 node. Of course - you avoid the administrative costs. But it's pretty expensive for an early-stage startup. Possibly a good deal for a larger company, but at that point you might prefer to just configure the EC2 nodes, backups, etc. yourself and save a good bit of $ every month.


On the other hand, $200 is a common price for two hours of Rails consulting these days. So as a developer bootstrapping a product, I'd rather pay for this and sleep well (one of their bullet point).

I'd be curious to know how reliable it is in practice, compared to a home-grown EC2 setup.


Plus if you need a larger instance like a 7.5GB RAM instance, it's $800/month on Heroku vs $75/month for a reserved EC2 instance with comparable RAM. That's an astounding difference in price for cloud storage!

Of course, the real competition is other data-storage services like Amazon RDS, MongoHQ, MongoLab, etc. (all much cheaper)

(OK, so I excluded EBS costs from my EC2 comparison above but at $0.10/GB-month it's still going to be a huge difference for most cases)


Your numbers are a bit off.

It's more like 162 bucks a month for a comparable reserved EC2 instance. And 208 a month for a comparable reserved RDS instance. And that's of course with a 12 month commitment.

You can't really compare EC2 to an offering like this because you're paying for the amount of time / labor costs it frees up. And of course RDS doesn't support Postgres so there's not a lot of competition at the moment.


Pricing things to be affordable for early-stage web startups seems like a good way to go out of business.

$200 per month for a database server is ridiculously cheap for a large company.


I think it is likely that Amazon will do a Postgres RDS soon too.


I don't know about you guys but one of the cheapest optimizations you can do is making sure your database and application server are "close" to each other.

Does a cloud database service make sense? Maybe. Databases have gotten easier and easier to setup. In addition they are quite optimized in terms of processing power. I'm not sure I want to give up control my database just yet.


Hmm.. It seems expensive. For this price I would chose Oracle Database RDS. Small Oracle DB instance is ~120/month on AWS.


why is PostgreSQL instead of MySQL? What are the advantages of Postgres over MySQL in DBaaS implementation?

I've seen some Database-as-a-service projects and all of them chooses PostgreSQL.


Probably because for the sorts of applications dbaas implementations target, PostgreSQL dominates the market among open source solutions.

MySQL is very popular on the web site, but as a database serving multiple applications from the same db, PostgreSQL has been the de facto choice in the open source world since I have been working with db's (at least the last 12 years).


licensing might be an issue.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: