Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Since when did SSDs need water cooling? (theregister.com)
75 points by Bender on May 27, 2023 | hide | past | favorite | 72 comments


This author is confused, or the article is just badly written. The thing that draws all the power in these newer SSDs is the controller, not the memory. Shoveling 2 million IOPS to the host CPU is a difficult task … your high-power host CPU can barely keep up. But the article goes on and on about the flash and hardly mentions the controller.


> This author is confused, or the article is just badly written.

I see you have discovered The Register. Welcome!


Because the controller and flash are on the same physical device in a very small amount of space, at least somewhat thermally coupled, not just by the PCB, but by "heat spreaders" often sold with the drive or part of the motherboard. Google around and you'll see lots of thermal camera images of M2 drives.

As the article points out, these drives consume up to ~10W under load. That's actually a lot of power for something with very, very little thermal mass - around 10 grams, and a heat capacity of around 400J/KgC is common for PCBs and chips. 0.4J/gC means that for just one second under full load, if the heat is generated evenly across the entire device, it will heat up 2.5 degrees C. Assuming no cooling, that's 24 seconds until it hits is thermal throttling point.

From the article:

> The amount of activity taking place on the gumstick-sized M.2 form factor means higher temps not only for the storage controller, but for the NAND flash itself.

> NAND, Tanguy explains, is happiest within a relatively narrow temperature band. "NAND flash actually likes to be 'hot' in that 60° to 70° [celcius] range in order to program a cell because when it's that hot, those electrons can move a little bit easier," he explained.

> Go a little too hot — say 80°C — and things become problematic, however. At these temps, you risk the SSD's built-in safety mechanisms forcibly powering down the system to prevent damage. However, before this happens users are likely to see the performance of their drives plummet, as the SSD's controller throttles itself to prevent data loss.

FYI, Tanguy according to his linkedin is the principle product engineer for Micron.


I'd question could we not use more the powerful, well-cooled, cpu that we already have in our computers instead of pushing the ssd controller complexity and power ever further. What if we used something like UBIFS or F2FS and removed/simplified FTL?


Speaking from memory and coarse grained understanding,

1) MLC/TLC/QLC work more like 4/8/16-tone grayscale e-paper than flash: e.g. 0x10 = (1,0,1,0), that’s “4 level/bits per cell”. And it’s not a single pulse of 0x10 voltage into a memory cell, more like repetitive pulses from 0b1111 to enough millivolts below 0b1011. Readout is probably more complicated, let alone lifecycle management. Those businesses might be more involved than it’s worth filesystem researchers time.

2) It was often said, at least years ago, that the considerable fraction of heat in NVMe SSDs comes from PCIe serialization/deserialization(SerDes), rather than payload data processing or NAND programming.

If both of above are true, maybe it’s PCIe that should be replaced, with something more like the original PCI?


The original PCI was parallel. You can't have an excessively fast parallel bus, because the tiniest differences between lanes make different pins receive the signal out of sync. This is why RAM interface is so hard to get right, and the lanes there are as short as possible.


Once you get to this level of performance, the bulk of a SSD controller ASIC is essentially just high-performance switching fabric; directing the flow of data between the 8-12 channels of NAND flash, the PCIE bus, it's internal buffers and the DRAM.

If you have any experience with high performance networking equipment, you know that pure switching fabric ASICs generates a lot of heat on its own. Hell, even a dumb 5-port gigabit ethernet switch generates a surprising amount of heat, they are always warm to the touch.

I really doubt that handling the FTL layer on the controller adds that much extra power draw. A dumb PCIE <-> NAND switching ASIC will also have cooling problems.


I recently upgraded my home network to 10 gigabit and was surprised by the amount of heat generated by 10 gig switches. Why do pure switching fabric ASICs generate so much heat?


Gates consume electricity when they switch: some energy is needed to flip a FET from "open" to "closed" or back. Then the gate stays in a particular state for some time, allowing the circuit to operate.

The faster you switch a gate, the more often you have to pay the switching price, which cannot go too low, else the thermal noise would overcome it. So you spend roughly 10x the energy switching a 10 Gbps stream as you use for 1 Gbps stream. Newer, smaller gates consume less energy switching, but not 10x less.


> Why do pure switching fabric ASICs generate so much heat?

There is also the fact that they are consumer devices and margins have to be high so quality of the product is tailored accordingly.


SSD controllers are just ASICs. I'm not sure why you want to go from a chip tailored for a specific task to a general purpose one that already has way too much on its plate. Then there's things like latency and how the controller abstracts away the details of how an SSD works internally. All that complexity doesn't go away by putting it on your CPU. You're just moving it from one place to another for no benefit, and adding other complexities.

Could ask the same thing about all the extra silicon in GPUs that adds hardware acceleration for video encoding/decoding.


The point of doing it in software on the host kernel would be to allow the flash layer and the filesystem to co-evolve, instead of being agnostic at best and antagonistic at worst.


Who's going to write the code for it? Microsoft? Does every manufacturer write their own kernel-level driver? What happens to Linux/Unix? I don't want any of these manufacturers anywhere near the kernel, or even doing any more in software than they already do. Samsung isn't exactly known for code quality.

This is a fantasy with questionable benefits at best that don't outweigh the downsides.


> Microsoft?

Yes. That seems ideal to me. Microsoft, Apple, open source contributors. Today what you have is a closed-source translation layer written by the kinds of people who write PC BIOSes, i.e. the biggest idiots in the software industry. I would be much happier with an OS vendor flash storage stack. For all I know, I am already using something like that from Apple. And I assure you that large-scale server builders like Amazon and Google are already doing it this way.


calling those layers of people idiots is doing them a favor, excusing the de facto practices as being simply dumb. The truth includes a different layer, the business of business, who pays, who gets to do what. Booting your own hardware is the subject, and the actors there are not the ones that come to mind, thinking of consumer advocacy.

The largest companies have other alignments that are not often discussed openly.


> don't want any of these manufacturers anywhere near the kernel, or even doing any more in software than they already do. Samsung isn't exactly known for code quality.

That seems such a bizarre take. You think it's better that the crappy code is given to you as blackbox firmware with no oversight rather than in the open written to kernels quality standard where it can at least hypothetically be improved?


Although, with flash memory cells nearing their physical limits in lithography, pretty soon you'll need active cooling for bigger stacks.


I doubt one is far from the other.

Shoveling IOPS into a bus is an easily parallelizable problem, while NAND-flash memory has a very high theoretical floor on its capacitance. Any good engineer would optimize the CPU part up to the point where it's only a bit worse than the flash, and stop there because there isn't much gain on going further.

If that's the case, you will see the CPU being the bottleneck on your device, but it's actually the memory that constrains the design.

That is, unless the CPU comes from some off the shelve design that can't be changed due to volume constraints. But I don't think SSDs have that kind of low volume.


> That is, unless the CPU comes from some off the shelve design that can't be changed due to volume constraints.

Most SSDs (with exceptions like Samsung's) simply use SiliconMotion's IP (https://www.siliconmotion.com/products/client/detail) for their controllers.

> But I don't think SSDs have that kind of low volume.

If a custom design adds a cent or two to the BOM then it doesn't matter, but when you need to verify that the changes works as intented and that the data isn't corrupted (beyond specifications) that's a lot of cents to be saved. Plus, SiliconMotion can request to TSMC to fabricate it at a lower cost per unit (because there is only one pattern to manufacture) than to customise the controllers for each drive.


You're vastly overestimating Silicon Motion's market share. Samsung, Micron, Western Digital, SK Hynix(+Intel), and Kioxia all use in-house SSD controller designs for at least some of their product line. Among second-tier SSD brands that don't have in-house chip design or fabrication, Phison is dominant for high-performance consumer SSDs.

Speaking about SSD controllers in general: they do use off-the-shelf ARM CPU core designs (eg. Cortex-R series), but those are usually the least important IP blocks in the chip. The ARM CPU cores are mostly handling the control plane rather than the data plane, and the latter is what is performance-critical and power-hungry when pushing many GB/s.


One of the issues with M.2 in desktop PCs is how buried from the airflow they are, and often they're literally on the exhaust side of the GPU (many GPUs exhaust on both long edges, and many motherboards just so happen to have an M.2 slot under the PEG) or in the air-flow dead-zone between PEG and CPU cooler.

Overall the AT(X) form factor, with extension cards slotting in at a 90° angle, just doesn't work all that well for efficient heat removal. DHE takes away I/O slot space and requires high static pressures (so high fan RPMs), it works for headless servers, but that's about it. The old-fashioned way of a backplane and orthogonal airflow does work much better for stuff like this; but it also requires a card cage and is not very flexible in terms of card dimensions. The one saving grace of ATX is that cards and their cooling solutions can grow in length and height, GPUs are much taller than a normal full-height card, and many are much longer than a full-length card is supposed to be as well.


> ”Overall the AT(X) form factor, with extension cards slotting in at a 90° angle, just doesn't work all that well for efficient heat removal.”

To give an example of this, here’s a server from a huge cloud provider for a brand new AMD 7700 on an ATX board.

Those 90° angles make for horrible airflow.

https://twitter.com/PetrCZE01/status/1637122488025923585


That look more like "you put powersupply on wrong side".

But yeah, most servers have risers that flip the cards to be parallel to the board.


It should be clear that this was only shot for marketing purposes. I don't think they actually run cables like that, but it probably looked better to have cables visible in the picture.


Do you know this for a fact?

Or are you speculating.


That looks like a ribbon cable blocking the fan?


Yeah, but X470D4U and similar boards are so overpriced one can somewhat relate to people using gaming boards for servers. Especially since a lot of them route ECC pins nowadays.

I sure wasn't happy paying extra just to have a different board layout with mostly the same components.

Well, there's IPMI at least. Still not worth the price tag.


People are buying gaming branded PEG propping sticks and sustainer wires because high end PEGs are sagging, and neither the case nor the card support that front slot for full length cards. It’s well past the time for a card cage spec as far as I can see from the user perspective.


This isn't really an issue because everyone's M.2 is working fine. You have to construct absurd scenarios to cause problems. Use a case with bad airflow, a hot GPU, and a workload that is pushing the gpu and m.2 and cpu to their limits indefinitely, which isn't a real life thing.

and if it IS a real life thing because you have some special use case, you use a case with good airflow.


I think this depends on what the definition of fine and problems is. IMO most don't even notice when their drive throttles due to thermals so it's probably fine that the drives get hot. At the same time, as newer drives keep drawing more and more power, this is going to start to push the limits of "well why did I buy the fast drive in the first place" if they didn't come with these ever increasing cooling solutions as well.


I'm water cooling both the CPU and GPU in my PC and have found out that leads to virtually no airflow over the m.2 slot. For now I've simply placed a fan aimed directly at the slot on top of the GPU and that keeps the SSD at a 50 to 60C. I am considering installing a water block on the SSD when I do maintenance next.


All of it doesn't matter because SSDs like being hot.


Sure if you dont like your data


In the same that an ice cream is too cold and could use some heat


There were early PCI-E 5.0 SSD samples pushing closer to theoretical max of 16GB/s but were consuming up to 25W. The current PCI-E 5.0 are only about ~11GB/s. But stays within a 12W power envelop.

I do wonder if we have hit law of diminishing return. With Games optimised for System on PS5 and Xbox's DirectStorage, developer are already showing 80-95% of load time are spent on CPU already.


As soon as SSD's are faster, developers will find ways to waste more space and do more IO operations...


Maybe we should revive U.2 on consumer desktops. The vast majority of desktops will be fine with only a single M.2 drive, but back in the day the vast majority of desktops were fine with a single SATA drive and yet motherboards commonly came with 8 SATA ports.


What's the issue with a jump from 12w to 25w?


We're talking about an SSD form factor that's 22x80mm and is fed by a couple of card edge pins carrying 3.3V. 12W was already pushing it.


According to the one-page datasheet of a Foxconn M.2 M-key socket, maximum current per pin is 0.5 A (they're tiny, after all). Since M.2 M-key has a total of nine pins carrying 3.3 V this would limit power to 15 W before any heat dissipation considerations plus connector derating because the toasty SSD is heating the connector up.


...on a device that weighs about 10 grams with a heat capacity likely around 0.4J/gram-degrees-C.

10Ws for such a device if I did the math right is around a 2.5C/sec rise in device temperature.


The M.2 sockets on motherboards don't exactly have great cooling, and SSDs don't usually even come with radiator.

The positioning can also be pretty iffy, mine have one next to CPU, another just under GPU (no chance getting fan there) and those are the "fast" (directly connected to CPU) ones!

The another 2 slots are again under GPU (one filled with wifi/bt card), and only last 2 are far away from other hot components and get its own heatsink. but those are not directly connected to CPU


I don't think this would be a factor. Some PCI-E 4.0 SSDs already come with metal heatsinks out of the box. If future SSDs needs addtitional cooling, this will be communicated to the buyer.

I think that the bigger question is whether 25W can be phisically supplied to the drives by contemporary motherboards. What is the power limit for the m.2 ports?


At least according to wikipedia each pin is rated up to 0.5A with I think nine 3.3v pins so technically just around ~15W peak

Technically that's what U.2 (2.5 inch form factor for SSDs) would be for.

They get 5V/12V and thicker connector, I severely doubt M.2 could swing 25W as it only has 3.3V on it


For those don't read the article:

> "NAND flash actually likes to be 'hot' in that 60° to 70° [celcius] range in order to program a cell because when it's that hot, those electrons can move a little bit easier," he explained. ... Go a little too hot — say 80°C — and things become problematic


Sure... erasure energy is lower when it's hot... But there are lots of other downsides, like a much reduced endurance, and more noise in sense amplifiers meaning there is a higher chance of needing to repeat read operations.


Design priorities probably get warped by doing well at artificial benchmarks/torture-tests in reviews coming to the fore.


Absolutely. CrystalDiskMark is bad for the consumer SSD market.


Unfortunately the alternative is trusting the manufacturers data which leads to cheating.


The choices aren't exactly between a bad benchmark and no benchmark at all. And the widespread use of CrystalDiskMark as a de facto standard by both independent testers and drive vendors has done nothing to slow the rise of behavior that an informed consumer would consider to be cheating.


Does anyone know of a PCIE 5 SSD designed for sustained read/write? Most of these new drives are meant for short bursts of data transfer, I can’t imagine water cooling would be necessary for most drives.



I don't understand why pushing 16gb/s requires so much power. A fully custom ic where the data path is in silicon should be able to handle that speed no sweat.


SSD controllers aren't just moving a lot of data. Between the PCIe PHY and the ONFI PHY there's a lot of other functionality. In particular, doing LDPC decoding at 16GB/s (128Gb/s) is not trivial.


ECC/Crypto is pretty energy intensive - other bookkeeping like wear leveling and r/w disturb is also quite complicated.


16GB/s over a serial interface with 4 lanes is 32Ghz per lane.


>While NAND flash tends to prefer higher temperatures there is nothing wrong with running it closer to ambient temperatures

New one on me :) I did not know NAND liked to be hot, if true does not bode well for laptops for over-clockers.

To me, the end result seems to be, yes and no, up to you. But I still prefer HDD anyway, I am very old school.


You do not actually prefer HDD, and nothing bodes badly for overclockers or laptops. You are just looking for ways to be contrarian.


What's wrong with HDD? It's actually quite convenient having time for your morning jog and a shower while you wait for your computer to boot up.


Nothing at all, I like to be able to count my IOPS on my fingers.


As a bonus and it's really old it sounds like an old gravity fed drip coffee machine.

on a heat note: a spinning rustdisk also uses quite a bit of watt, every time, all the time. Powerwise the high W ssd's are still less power hungry over time.


A 1 TB 2" HDD I attached to an Odroid consumes little more than 1 W. A 3.5" 2 TB one consumes 10 W. I turn them off by software when I don't need it. They are a backup storage.


Preferring a storage medium for its reliability regardless of the amount of writes it endures is utility - I could see that might be preferable to SSD in some specific case. Maybe there are other upsides, eg, it's often much cheaper.

Regardless, some people drive an old, dangerous, slow, gas-guzzling car - and maintain it at great expense - just because they prefer it. Aesthetic and sentimental appeal is highly personal and knows no bounds.


Except hard drives aren't celebrated for reliability. Or speed. Or low latency. Or durability (try knocking one). Or power and heat. Old cars you can at least make some arguments for... Hard drives as primary storage/boot, no.

Really they're good for bulk storage. And that's it. For use in primary compute they're really great if you want to slow everything down.


AFAIK compared with SSDs they are better (reliability).

And running any electronic component hot is just asking for trouble.


> AFAIK compared with SSDs they are better (reliability).

Depends on the price point.

Just days ago PM1725 gave us trouble. Yet, five WD10JUCT I bought recently (in R5) beat it on the price and available capacity, even with abysmal performance.

>any electronic component hot is just asking for trouble

I'd say running too hot.


>hard drives aren't celebrated for reliability.

With no revelry whatsoever my 2006 early SATA Maxtor 100GB HDD is still going strong with Windows 11 on a Dell Vista PC.

Boots no slower than our IT guys have 2-year-old SSD W10 PC's doing at the office.


I love my HDDs. I just invested in 90TB. That would have cost me a little more than two and a half times as much and be six times as many drives. I do not have the the sata ports or the power connectors for that.


Please tell us more about the psychology of the parent commenter.

You seem to have studied it quite well, or perhaps find that ad-hominems make for the best arguments!


NAND is bad for cold storage, because writes cause more wear when it's not warm. Meanwhile data retention benefits from lower temperatures.


cold storage by it's nature doesn't have a lot of writes...


And if there was a hot (as in frequently used) drive it would still heat up (as in temperature).

But I guess the other commenters point might be valid if you run a datacenter in a blast chiller




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: