Power Failure Testing with SSDs

dboreham · on Nov 12, 2015

I can see that the author put some considerable time into testing and writing the article, but nobody in the business of deploying production Posgresql should ever have been using the various models of desktop SSD that were tested. This is because the subject of which SSD are reliable has come up repeatedly on the PG mailing lists. Several heavy PG users and developers have already done testing and the universally accepted advice in that forum, for years, has been to never ever use a drive that lacks power fail protection.

Disclosure: we run Postgresql in production on Intel and Samsung "data center grade" SSDs, and I participated in the aforementioned PG mailing list discussions.

Updated: from http://www.postgresql.org/docs/9.4/static/wal-reliability.ht...

"Many solid-state drives (SSD) also have volatile write-back caches."

and this thread: http://www.postgresql.org/message-id/533F405F.2050106@benjam...

and https://news.ycombinator.com/item?id=6973179

kustodian · on Nov 12, 2015

"...but nobody in the business of deploying production Posgresql should ever have been using the various models of desktop SSD that were tested." this is one of the reasons why this article was written. To show you how to test, to see what you should use and what you shouldn't in production. When we did these tests we already knew we shouldn't use those SSDs, we mostly did it just for the sake of testing, to see if the tests we wanted to use make any sense and to show others how to do it.

Also I know it may sound funny to use these SSDs in production, but when we started with our game Top Eleven we didn't know that, when you have 5 people working on a game, you don't have time (nor resources) to think about a type of an SSD; you order server with an SSD, you get an SSD (you usually don't even have a choice when you rent) and you use it. This is how it still works with almost any server renting company.

dboreham · on Nov 12, 2015

Ok, you didn't know something that's common knowledge in the industry - that's reasonable. I'm sure there are plenty of things I don't know that I should. But there are only three people in our company and we spend quite a bit of time worrying about SSD reliability because our business depends on it :)

fulafel · on Nov 12, 2015

Testing beats reasoning from specs.

kustodian · on Nov 13, 2015

I don't know in what type of business you are, but in gaming if one server breaks, it's not the end of the world, nobody is going to die, or lose money about it :P

t_tsonev · on Nov 12, 2015

Am I the only one who thinks RAID controllers are a placebo and wouldn't trust anything but ZFS?

What was even more interesting is that our SSDs are connected to the Dell H700/H710 RAID controller which has a battery backup unit (BBU) which should make our drives power failure resilient. RAID controller with BBU in case of a power failure can hold the cached data until the power comes back, so that it can flush it to the drives when the drives come back online.

ploxiln · on Nov 12, 2015

You're both right and wrong.

RAID controllers that you could afford for yourself, costing less than $1000, are just another point of failure. I've seen them fail more often than the disks, causing corruption as they went. However I've also seen the very expensive datacenter RAID controllers keep a whole bunch of servers up for years as their (15k rpm spinning rust) disks were failing and being swapped.

SSD problems with power failure are such that ZFS won't save you. ZFS's checksumming will at least know what's corrupted and what isn't. But the extent of the corruption will be practically unlimited if the OS can't trust the drive to honor write barriers / flushes - any byte since the last powerup could still be only in volatile cache on the SSD and never make it to persistent NAND.

Every filesystem depends on these occasional flushes/barriers to establish checkpoints where all previous writes are really written, including ZFS. Consider, it may have created an updated copy of the filesystem tree root node, made sure it was flushed, made another updated copy, made sure it was flushed, and then (only after the flushes) re-used the space that the original copy occupied for other data. If you can't trust that the flushes actually happened when the drive indicated they were completed, then it might be that only the last write, over the original root node copy, actually made it to the NAND.

jcrawfordor · on Nov 12, 2015

I'm not sure that you could say that RAID controllers are a placebo - I've been through multiple hard disk failures that either hardware or software RAID has enabled a graceful recovery from (with no data loss).

As for whether or not it's superior to ZFS, though, that's a tricky question. High-end hardware RAID gets you a lot of neat features that you don't find elsewhere, but it's expensive and the RAID controller itself tends to become more and more of a point of failure (always keep spares, especially if they go out of manufacture!)

scurvy · on Nov 12, 2015

You and the author of the title are confusing cache layers and "battery protection." A RAID BBU will only protect the RAID controller's write cache; the thing that your write IO sits in until the controller flushes to the disk set. That's either 4-5 seconds depending on Dell/LSI model or until it gets pushed out for cache capacity.

The above has absolutely zero to do with the drive write cache and power protection. The drive write cache is used to cache the write on the drive after the RAID controller. Spinning metal drives have caches. SSD's have caches. Whether you use the cache is up to you and your OS. The safest setting is off. You can usually get more performance by leaving it on. This is why Windows would throw warnings all over the place when enabling write caching without a connected UPS. I'm not sure exactly what the logic is on Linux for sd/sg devices.

The default Dell RAID controller behavior with write caching on drives is based on drive interface. SAS = off. SATA = on. If you run non-power safe SATA drives with a Dell RAID controller, you must disable the write cache if you want data durability.

Then again, you could also run your databases on power-safe, non-consumer SSD's. This whole article was basically a giant warning sign saying "run away they need serious help."

mgrennan · on Nov 12, 2015

I worked in Dell's High complexity support for a couple of years. I saws RAID disk failures all day long. I BELIEVE SOFTWARE RAID IS YOUR FRIEND. I've seen hardware RAID system become corrupted because a miner version of the controller software was used after a failure. Except for multiple platter crashes, there has not been a software raid (linux) and Spinright there has never been a RAID I couldn't recover. So... Hardware Raid is NOT worth the extra speed.

coldtea · on Nov 12, 2015

>Am I the only one who thinks RAID controllers are a placebo and wouldn't trust anything but ZFS?

No, there are others that cargo-cult believe in ZFS too

-- a hyped single vendor OS, without first-tier-support on Linux, and with its own issues, compared to a industry wide standard, used for 3 decades in the most demanding data-centers protocol and its implementations.

andor · on Nov 12, 2015

a hyped single vendor

Who's that single vendor? FreeBSD, OpenIndiana, Joyent, Oracle?

compared to a industry wide standard

Good luck exchanging your fried RAID controller for a "comparable" model

aexaey · on Nov 12, 2015

> Good luck exchanging your fried RAID controller for a "comparable" model

If you are running Linux/*BSD/SmartOS, the best RAID controller is a JBOD controller, i.e. one that that exposes all connected disks as-is to the host OS, and than it would be in-kernel soft-RAID implementing the actual logic.

This approach gives you a more predictable system with no vendor-specific idiosyncrasies or extra cache level to worry about (BBU in OP's article); you end up running RAID code that is peer-reviewed, fully integrated into fsync() algorithm, and surprisingly, often gives you better performance too.

"True hardware RAID controllers", on the other hand, are nothing more than an application-specific computer with its own (non-upgradeable and often outdated) CPU, RAM, I/O and hard-to-upgrade proprietary software.

And if you buy into the view described above, than replacing a JBOD controller for RAID use is exactly the same thing as replacing JBOD controller for ZFS use, by definition.

kokey · on Nov 12, 2015

I currently deal with a few disk failures every week, all on RAID6 (on Dell H7xx controllers) and a few RAID-1 on HP and Adaptec controllers. Mostly SAS drives, but a few older model SATA SSD drives (that are dropping like flies after lots of writes over a year). This is across about 32000 disks so the chances for some disk failure every week is high. The RAID controllers work as advertised almost always, file system intact. Sometimes there's a performance degradation during a disk rebuild, but only on arrays where disk i/o is near max ability. There has been, I think two, catastrophic failures, for example when there's an issue with the cable or backplane, causing corruption, then having more than two disks go corrupt before getting to the bottom of it and invalidating the entire array. That said we have very few issues with power, and the batteries have died on many of the controllers, so I'm not sure how it will pan out on dodgy data centre power flapping.

devit · on Nov 12, 2015

Seems pretty inexcusable to release SSDs that get corrupted on power off.

These drives are defective, and they should refund customers for the SSD prices, plus some hefty compensation for the potential data loss they could have or have caused.

If you make a storage system and ack a write/sync, it had better be durably written.

baruch · on Nov 12, 2015

Most non-enterprise SSDs do not have an internal supercap or other power protection mechanism and they are not intended for a server use-case and shouldn't be used in such capacity.

An HDD will not hold data that is written to its write cache either so the SSDs are well within the spec.

arielweisberg · on Nov 12, 2015

That is not at all what the blog post is claiming nor is it the reality. It's also not what the commenter you're responding to is complaining about.

The complaint is not about losing data in a volatile cache the complaint is that drives will lose data even after the drive has claimed to have flushed it's write cache after being given a write barrier.

perlpimp · on Nov 12, 2015

These are old (320-520) SSDs we are using Intel 3500 & 3700 DC series and haven't had any issues. with few power outages in past few years.

840Pro is a consumer grade drive and has reduced guaranteed max write capacity and larger variance in quality from unit to unit. Instead you should use server type disks instead: http://www.samsung.com/global/business/semiconductor/minisit... We use these in our analytics servers, they have stood up to test of time many times, and without any issues. SSD tech evolves so fast that 3 years since release of model lineup seems like forever, especially in an intensive write prone environment(for example writing slows reading but by how much? DC level drives are way ahead in this and very consistent as per what we found out in our tryouts).

kustodian · on Nov 12, 2015

Yeah I know that these are old drives, we've done this testing about year and a half ago and when we started renting servers more than 5 years ago, we didn't think about which SSDs are in there. At that time we didn't have an option to choose, nor time to think about the implications of different SSD models, we used what we got. Later on when we grew and started having problems, we started investigating what models we should use. The main reason for writing this article is for people to be aware that they should test their drives and how to do it.

acqq · on Nov 12, 2015

What's the decision behind not testing Samsung with "On On" cache and barriers? Is it so much slower that it's not worth testing? Shouldn't barriers allow the cache on the disk to "know better" how to organize writes and still be faster than without the cache turned on?

The "disk cache" is a disk hardware option (how it uses its own RAM), if I understood, and the barriers are just an option of the FS behavior (software). I'd expect that the performance penalty to the former is much higher than to the later?

kustodian · on Nov 12, 2015

The reason is the performance penalty for barriers On + disk cache On was much higher than for barriers Off + Disk Cache off, at least that is how it looked like on our testing system. Of course that could be do to the fact that we are using a RAID controller which has a 1GB of cache. If you are not using RAID, "On On" would be a valid test.

Already__Taken · on Nov 12, 2015

Can I just get a decent sized capacitor on the power rail to the SSD to keep the half second of power to empty the state that's cached and avoid these problems? What's the capacitor doing in enterprise drives that triples the price?

jacquesm · on Nov 12, 2015

In theory, yes. But in practice the drive would still see power applied and would not trigger the 'dump the cache' firmware until it was too late. You need a little bit of extra circuitry to warn in time before the power gets too low.

Usually this works like this: an external voltage powers the drive, inside the drive this voltage is split into two circuit paths. One drives a voltage regulator with a large capacitor on the far side of it, the other drives, usually accompanied by a pull down resistor an input of the cpu in the drive. If that input goes below a specified value then the cpu knows the power is about to fail and will trigger the cache write to the persistent media.

Obviously that only holds for the data already present in the memory of the CPU on the drive, the rest of the computer will have to fend for itself.

So just adding a capacitor on the outside will keep the drive powered up for half a second longer but won't initiate a cache dump (assuming that you're not going to end up powering the rest of the circuitry as well, you need to add a diode or something to avoid backfeeding the circuit that charged the cap in the first place).

pkaye · on Nov 12, 2015

I have developed power loss protection for SSDs in the past. You also need to give an early warning "power failed" signal to the controller to dump its internal buffers and state.

Power loss can be very challenging to get robust especially if the controller uses clever algorithms at runtime to get better performance because recovering the drive state from sudden power loss is more difficult. That is why I think the Intel 520 controller has so many problems because they are using a SandForce controller which was known to use no external DRAM and compression algorithms which just complicates things.

mrb · on Nov 12, 2015

There is nothing that makes SSDs with power loss protection inherently expensive. As always, it's just market segmentation...

A large supercap in the 100-200mF range for an SSD is around $1 or $2. In fact you can implement power loss protection with less capacitance with regular tantalum caps like the Intel 320 did (http://www.storagereview.com/intel_ssd_320_review_300gb). But drive manufacturers see the consumer market doesn't care about power loss protection, so they decide to scrap the feature, which saves a buck or two, and saves some PCB space.

mavhc · on Nov 12, 2015

Which is odd, because in my world ssds in laptops will lose power, but ssds in servers are on UPS, so will be shut down gracefully

jws · on Nov 12, 2015

A laptop running down its battery should shut itself off gracefully. That won't trigger the SSD power loss problems.

It isn't a clean OS shutdown usually, but an orderly transition to a hibernation state of some variety which should include flushing drives.

drzaiusapelord · on Nov 12, 2015

That's true until your battery ages a bit and now the on-board battery monitoring is inaccurate because it thinks there's 5% or so left in power when there is none. I have more than one laptop that will just let the battery run out because the on-board diagnostics think there's juice left when there isn't any.

I think if you're selling laptops, then you should worry about these types of cases. Not to mention cases like having Windows Updates run on battery which means the laptop can't hibernate when after its started these installs at shutdown.

Standby/hibernate is still far from perfect. A fifty cent capacitor shouldn't be a dealbreaker for ssd manufacturers.

Already__Taken · on Nov 13, 2015

People who sell laptops want you to go and buy new laptops not just replace the battery. Why design around failed components? The answer you want is to replace the battery not fork out for an expensive ssd option that's unnecessary for 100% of the design life of the product.

edwintorok · on Nov 12, 2015

It is not just power outages that can corrupt SSDs, it is any unexpected power loss without the preceding SATA commands, so I think this would include unexpected OS reboots, kernel panics, etc. [1]

I've recently bought a new SSD and was searching for information on power loss protection, and the only vendor documentation on the matters seems to be [1] and [2]. SSD reviews have plenty of performance numbers (that interest me less), and besides sometimes describing what the vendor says about power loss protection they don't perform any actual testing for power loss protection, and end up being fooled by the vendors sometimes [3]. The only actual test I found was [4].

Capacitors are one way of protecting, but for some reason even some of the newer enterprise SSDs sometimes have them, sometimes not, even if older versions had them. Some SSDs claim to have journaling (on SLC NAND) instead, but given how the firmware is closed source there is no way to inspect it for bugs.

[1] "Storage devices require a graceful removal of power to ensure data integrity is preserved. Graceful removal of power includes commands to signal to the storage device that power might be imminently removed" http://www.sandisk.com/Assets/docs/Unexpected_Power_Loss_Pro...

[2] http://www.intel.com/content/dam/www/public/us/en/documents/...

[3] http://www.anandtech.com/show/8528/micron-m600-128gb-256gb-1...

[4] http://lkcl.net/reports/ssd_analysis.html

acqq · on Nov 12, 2015

The capacitor doesn't itself significantly increase the price but there are more differences than just a capacitor in the datacenter drives. Look at the power consumption for the start, DC ones use much more power on idle. Also look at the technology behind: DC variants have to be more conservative with the stress to the cells (less bits per cell, potentially less cells per area, potentially more spare cells to replace died ones etc).

And the market segmentation is the natural way to go. The different needs and different sensitivity to price determine the price.

acveilleux · on Nov 12, 2015

You need a trigger as well so when the power rail voltage drops, the firmware dumps its cache. With a substantial write-back cache, the capacitor or battery needs to be sized sufficiently large to write out potentially hundreds of megabytes.

pjc50 · on Nov 12, 2015

Supercaps are themselves expensive, bulky, and may not be reflow-compatible.

brongondwana · on Nov 12, 2015

And yet again the takeaway is "buy Intel datacentre quality SSDs".

mato · on Nov 12, 2015

Indeed. See Luke Leighton's excellent "Analysis of SSD reliability during power outages" at http://lkcl.net/reports/ssd_analysis.html which came to the same conclusion some years ago.

vegardx · on Nov 12, 2015

It would be interesting to see them use Samsung SM843, which would be more comparable to Intel S3500.

dboreham · on Nov 12, 2015

We used Intel DC drives exclusively until one of them mysteriously failed, thereafter we used Samsung 845DC for new builds. So far they've performed well.

perlpimp · on Nov 12, 2015

we use both, none of them have failed us.

DiabloD3 · on Nov 12, 2015

This is why I've been deploying Crucial M500, M550, and MX200 drives over the years. They don't use full scale super-cap protection, but they're better than Samsung Pros and the few Intel drives without proper protection.

Generally power loss will not effect these drives, and I couldn't get them to scramble existing data or damage the drive while unplugging them or unplugging the computer they were in during heavy writes.

M600DC is Crucial's full scale DC model that offers superior power loss protection.

Crucial drives are manufactured at the joint Intel/Micron facility (using technology from both companies) that Intel's current lineup of drives are manufactured.

I agree with the article that S3500s have sufficient protection.

SanDisk also has a power loss protected drive, but the drives themselves don't seem to be any good. I'm hoping SanDisk drives produced under Western Digital's ownership will be much better.

scurvy · on Nov 12, 2015

Anyone notice the theme in the drives that are not power-safe? They all use Sandforce controllers.

acqq · on Nov 12, 2015

What was with these "barrier" things? What was actually turned on and off? Which filesystem exactly?

wyldfire · on Nov 12, 2015

> CentOS 6.5 was used, SSDs were formatted with XFS and they were mounted into /mnt/ssd1 and /mnt/ssd2. XFS was used because that is our main file system for databases.

write barriers is a mount option [1] [2].

[1] http://linux.die.net/man/8/mount

[2] http://lwn.net/Articles/283161/