Well, space, air conditioning electricity and redundancy comparable to what Amazon offers would add a few thousands per month (I guess $2000-$5000, depending on where you are).
Still, way cheaper than Amazon, but there are nontrivial running costs beyond the purchase and build price.
No. If you rely on a lot of storage it pays to have really smart people building the hardware and software, and maintaining it, with service agreements in place.
If 100 TB was as easy as buying a bunch of discs and slapping NASlite or FreeNAS on them then everyone would be doing it.
because nineteen grand a month buys you a boatload of hardware /and/ a really smart person. A really smart person that can also be used on other things.
S3, economically speaking, is an incredible deal if you are only using a little. But if you are using a lot? it makes NetApp or EMC start to look good, which is to say, only a good deal if money doesn't matter.
I mean, yeah, if you are in a situation where money doesn't matter, or rather a situation where the amount of money you are spending on the storage is trivial compared to what you are doing with the storage, sure, NetApp or EMC will deliver a far superior product. S3 may have it's uses too. But if you are doing something where the storage needs to be cheap? yeah, neither one of those players, nor S3 will get you where you want to go.
I've been following the online storage space for years. Every time something like this comes out I redo my back-of-the-napkin math on potential profitability, get excited, then find excuses to not move into the space.
Prices have gotten so low that there's no excuse for lack of quality online storage providers. There are many, but there is only a handful I would consider good and their pricing isn't following the market. I see this leaving a huge opening for competition.
If anyone else is seriously interested in online storage contact me (see profile).
eh, I'm working on it right now; I'll be bringing the prgmr.com values, which is to say, the focus is on cheap, standard, and transparent, uptime goals will be... realistic. I'm doing it with my co-author, and I have at least one investor who is interested, though I'm not sure we'll need him... even at our target of a penny per month per replication, it's going to be ridiculously profitable, probably enough so that we likely won't need money. What are you bringing to the table?
I've been thinking for a while about how you could make an S3-like service by loading up 36 drive boxes and shipping them off to lots of different cheap colo providers throughout the country, duplicating storage across them and managing the whole bit on a reliable provider like AWS.
Am I missing a zero? that's about $0.10 per gigabyte, no? which is dang close to what amazon charges. Unless you mean one hundred grand up front rather than one hundred grand a month.
hm. I think the problem is mostly the profit margin. cheap 3tb drives are what $130 each? and you need 333 or so of those to get 1000gb. If you go with, say, 12 disk zraid2 sets, that'd be around 400 disks. Of course, these are 'seagate petabytes' I'm counting, but that's only $52000 for the disk; then, say 10 of the supermicro chassis;
I think you could do it, I mean, at scale where labor costs are mostly amortized out, if you were willing to accept really low margins, or if someone would design and test it for free.
But that's the thing, someone like me? even a cut-rate dedicated server/VPS provider is used to charging around 1/4th to 1/6th the cost of the hardware /every month/ - obviously, this results in margins that are pretty nice by the standards of companies that sell goods with significant marginal costs, at least if you have enough scale that you don't blow it all on labor.
Really, the "big fee up front" model is interesting and warrants a discussion all it's own.
There may be a reason to build 20 backblaze pods, but I doubt there is a reason to build 1. The whole point of a backblaze build is to move the redundancy out of the individual box.
By itself this device is poorly suited to backup. Specially for their use case. A huge pile of desktop Hitachi disks isn't something you can stick in a rack and forget about, and people like to forget about backups.
Best case: you have someone who's constantly checking/maintaining this box and replacing the drives as they fail.
Worst case: an expectation of a backup, but every time you refer to it, it doesn't work.
Most likely case: A mix of both of the above. Where the box takes a lot of attention, and occasionally works as intended.
The only redeeming factor is that they proclaim all the reasons why this is a poor choice. So it seems someone is at least thinking about it.
I take it you only read the first part of the article.
They used RAID6 over 15 drives each, combining those 15 drives via LVM, allowing the loss of 2 drives from the same group without data loss. This is the same configuration Backblaze uses as well, because even though their fancy custom software mirrors all data to different storage pods, it is easier to just replace a drive and rebuild the raid than rebuilding the whole JBOD from the data on other pods.
Your point isn't entirely invalid, but I lose HDDs a hell of a lot more often than I encounter any of the other problems you mention. I don't think I've seen a power supply pop since 2006, I've never lost UPS/surge-protected gear to a mains spike, every server room or datacenter I've cared about has been environmentally monitored, heptafluoropropane exists for a reason, and why are you building a datacenter on a floodplain?
In contrast, I lose 5-10% of the spinning HDDs I care about every year.
why are you building a datacenter on a floodplain?
Every spot in our solar system is susceptible to some unrecoverable disaster. If you don't want to lose your data, you have to have copies in more than one physical location.
Seriously, I didn't say he was wrong in theory, but focusing so much attention on the highly unlikely to the detriment of the exceedingly common is a drastic distortion of the issue, and a cost/benefit analysis skewed in either direction can have costly repercussions.
No, I've not dealt with enough of them for long enough to have anything meaningful to say on the subject. So far I've not even had one clearly die (though there have been some very odd transient incidents.)
Jeff Atwood made an interesting post on that very subject a while back, though:
It seems like an interesting one-off kind of scenario. They have a lot of downloaded data, none of which is irreplaceable, but which would take a significant amount of time to redownload. This pod is a backup but a not very rigorous one, since the goal is to never have to download all the data, not to never have to download any of the data.
I'm looking at enclosures that can hold 60 disks. So 3TB SAS disks would be 180TB. Most enclosures of that kind have dual everything. Some SAS drives even have dual connections I believe (should look that up again).
At some point it's really a function of the size of the drives that makes the volume large. I think 60 drives is going to pretty much be the physical limit in 4 rack units.
You can buy the NL-108 from Isilon which has 108TB (36 3TB disks) if you want "enterprise." ;) Eventually they'll use 4TB drives in those.
"The backblaze 2.0 pod has exceeded expectations when it comes to data movement and throughput. We get near wire-speed performance across a single Gigabit Ethernet link." - http://bioteam.net/2011/08/backblaze-performance/
Oh my. These guys' expectations were very, very wrong. Many single-drive configurations can sustain 120MB/s+ at the beginning of the platter. A single drive! It should be no surprise that a 45-drive beast can saturate a GbE link!
It is too bad they didn't bond the two ethernet links. Reading through I didn't see a mention of what the other ethernet link was used for. That should have given them a jump in speed.
We're considering building one of these but I used to work for a clustered storage vendor. The import thing to notice here is this is single client performance - aggregate performance is often considerably better. ie, you might be able to push 1GB/s with 10 clients, but only 100MB/s with one...
Well... You can make all 46 drives into a zpool with raid-z and boot FreeBSD or *Solaris from it.
BTW, I can't see the vibration sleeves around the disks on the pictures. Vibration must be a problem when you pack so much rotating media in such a limited space.
Backblaze employee here. Yes, the 6 fans move a lot of air through the drives. We monitor every drive in every pod in our datacenter with the smartmontools (/usr/sbin/smartctl) which includes the temperature of the drives. The drives stay well within their recommended operating temperatures as long as the fans are running. In the past, we have detected that several fans failed in one single pod when alarms went off saying some drives were running hot, so the fans are working (normally) and the air is flowing. Side note: personally I think it's really awesome that consumer grade drives come with such detailed internal monitoring for free. I also think it is disappointing that desktop consumer OS's don't monitor these things BY DEFAULT and popup dialogs explaining to you when your drive is exhibiting high temperatures or high data loss!
A Google study about hard drive failure indicated that they couldn't find any conclusive relationship between temperature and failure rate. However, they did mention it had some effect at the extreme high end of the temperature spectrum (based on the graph, above 45 °C or so).
It doesn't take a huge amount of work to drive a disk over 45 degrees. Using 5K drives helps a lot, since many 7K drives will idle at 35 degrees in a room temperature environment and passive airflow.
I've seen actively cooled 7K drives in my ZFS NAS start giving checksum errors while silvering (i.e. recalculating parity etc.) after replacing a bad disk; smartmon reported temperatures of about 60, IIRC. Using an LSI 8-port SATA controller, I was silvering a 4-disk raidz at a rate of perhaps 300MB/sec, and it was making the drives too hot. I had to fall back to the motherboard SATA connectors (and ~150MB/sec array throughput) to keep things cool enough to complete.
If your active-cooled drives are getting that hot, then something is very wrong with either your ambient room temperature (is it 40°C?) or with your active cooling.
Active cooling in a reasonable-temperature room keeps even 15K drives well under 40°C at full duty (closer to 30°C, really, according to our monitoring data I just looked at, the room is around 20°C). Keeping drives within 15°C of ambient shouldn't be a huge deal (though it can be noisy).
It's only 400W in 4U (see my other comment in the thread). 100W/U is tiny, easy to cool. That's about two 120V-20A circuits per cabinet, which is very common.
What do you guys think about using this with Swift from OpenStack? Could I buy 100 of these pods, drop OpenStack on them and get close to 135,000 Terabytes with some level of redundancy?
Some of the hardware choices that Backblaze has made in their pod design are interesting. Most of the concerns I have (like the difficulty of getting to drives to replace them) can be addressed by operational practices (update the swift ring rather than immediately replace the drive). Other concerns (like 2gbe for 135TB) are more nuanced. Lack of redundancy in the Backblaze pods is addressed by swift itself--swift will ensure that no two copies are on the same pod.
I would love to see someone run swift on some Backblaze pods. If you'd like to talk further, my contact info is in my profile (or drop by #openstack on freenode).
Very low. I estimate the power consumption at the wall to be ~400W. That's $29/month at $0.10/kWh. Host this in a datacenter with a non-remarkable PUE of 1.5, and that's $43.5/month including cooling.
* 100 Watt for the mobo, CPU (73W TDP), SATA controllers, and all the fans. Source: my own clamp-meter measurements on dozens of PCs http://blog.zorinaq.com/?e=42
Wow. The cost of 1 Petabyte of spinny disk storage is $100k.
A friend of mine who raised money in 1999 said that to get their MVP off the ground, it took 20 engineers, and $7M in venture funding. I doubt $100k would have paid for the Oracle licenses back then.
Given that an empty pod costs $5,395, Supermicro might actually be cheaper (although a little less dense) and it's a real product instead of a prototype.
for my mass storage project. they are both well under $1500 with power supplies, backplanes and expanders taken care of.
For me, the big thing is that with my storage model, I'm going to be replacing disks as they fail and rebuilding the RAID, so having easily accessible and easily swapped disks is worth paying a premium. (I am planning to have some cross-chassis redundancy by using zfs snapshots, but I'd rather just keep the nodes going as is.)
Also, rack density? doesn't save you that much money. Most of what you are paying for in a data center is power. At the cheapest co-lo I'm in, here is the cost breakdown:
1 full cabinet (44u) with two twenty-amp 120v circuits: $875
1 full cabinet (44u) with one twenty-amp 120v circuit: $530
So, if I can double my density, I save $185 a month; and even at the disk density for my compute nodes (close to 100 disks in a rack) I get one disk failure maybe every two months per rack; so if I have to slide out the whole goddam computer, causing some chance of the power getting disconnected and downtime? yeah, with my model? it's probably worth paying the premium.
I'm just saying; if you are small enough that paying five grand for a backblaze pod.
Sure, at-scale, it's best to design your systems so that you can get zero downtime even when hardware fails. But, that's really difficult to do without introducing new failure modes; even amazon has trouble with it. My strategy is to accept that hardware failures mean a truck roll and downtime for customers on the hardware in question. As long as you don't have any one server go down more than once a year, (and a particular server failing once a year is pretty pathetic) and as long as you don't have a system where any one server brings down everyone, you are going to see pretty good reliability using this strategy.
I think there are a lot of people, even institutions like where I work, that are looking hard at storage that isn't block based storage from some huge vendor or storage that ends up being $5000+/TB (looking at you Isilon). Ok, maybe it's just me all alone here at my work place. :)
Even running RAID6 or RAID10 over 3TB SAS drives on some largish enclosures (ie. 36,45,60 drives per 4U) will be very cost effective even just using md, lvm, and xfs.
Or even better, use ZFS if possible. Not a huge fan of Solaris, but there is OpenIndiana, or FreeBSD has ZFS.
And object storage such as openstack swift is really gaining momentum and will likely replace most storage systems for large amounts of data over the next couple/5 years. There are single orgs that have put out 5.5 petabytes of openstack swift storage! Right now!
Who would you buy through? Googling for the 45 drive version gives me slightly higher pricing.
Also- The big advantage of the BackBlaze version is that people have done it before, and it mapped out. With the SM case, there's less community.. But it does look like a good solution.
It looks like the drive bays are pre-wired up, so you wouldn't need to worry about that?
What else would you need? Mboard/CPU/RAM, Raid cards, HDs and sleeves?
Like most supermicro chassis, it comes with the drive caddies, backplane, and power supplies. All you need is the motherboard, cpu, ram, and a SAS card, well, and the drives. It's even got an expander built into the backplane. If you want h/w raid, you need to bring that as well. (I plan on using zraid2, most of the raid cards that cost less per port than the drives are not better than software raid.)
I buy most of my supermicro stuff through kingstarusa.com - I know the site looks a little shady, and you have to email for quotes for almost everything, but they are good people. My office is actually above their warehouse; I'm unit C. Their price is usually a few dollars less than provantage, which is usually the next best retailer for supermicro chassis, and they don't do shady tax dodge bullshit, and I don't have to pay shipping. I could be misremembering on the exact price on the 45 bay.
Most of the 'mapping out' done with the backblaze version is already done and tested on the supermicro.
most of the raid cards that cost less per port than the drives are not better than software raid
Let's not get carried away; isn't a "gold standard" LSI controller only ~$1,000 ($27/drive)? But you certainly can buy a lot of Intel cores for that price.
yeah, I am exaggerating on the port cost some, but the big advantage of hardware raid over software isn't the hardware calculation of parity. A CPU can calculate parity so much faster than you can write to a drive that it doesn't matter at all that special hardware can do so even faster still.
The advantage of the hardware raid card is the battery backed cache. If it doesn't have a BBU and a fair amount of cache, as far as I am concerned, you might as well be using MD.
Hardware RAID cards have improved quite a lot recently; some of the stuff now has reasonably sized caches, so perhaps I should revisit my assumptions in this area. Of course, I'm planning on using ZFS on my storage servers, so even if hardware raid cards are now a reasonably good deal, they won't do me a whole lot of good.
As much as I like ZFS, I do feel a little misled by the rhetoric about replacing "expensive" BBUs with slog SSDs... that are actually much more expensive.
Until quite recently, I would not have understood what you meant. The cost per gigabyte for even really fast SSDs is lower than the cost per gigabyte of RAID cache ram, so I'd have said "what are you on about?"
But, I think I understand what you are on about now.
Most of us (well, speaking for myself, but I think this is true of most SysAdmins) have very strong experience telling us "more read cache is better" - I mean, more read cache, up until you can cache everything the server commonly reads, makes an absolutely huge difference in performance.
So we look for big caches.
The problem is that most of us don't have the same intuitive grasp of where the benefits stop coming when adding more space to the write cache, as most of us don't have a whole lot of experience with large write-cache systems (outside of netapp/emc type boxes, and I personally attribute their superior performance in part to their gigabytes of ram that can be safely used as write-back cache.)
So if write cache works the same way as read cache? yes I will pay the premium for the fastest 32GiB SSD I can find, if I can use it all as write cache.
The thing is, I'm told, that after a few gigabytes, the returns to adding more write cache fall off sharply; and if that's true, then yeah, you are right, 'cause you are wasting most of the SSD.
I mean, the real question here is "how much write-cache do I need before I stop seeing significant benefit to adding more write-cache?" and if that number is much above what you can get in a RAID card, then the zfs/ssd setup starts looking pretty good.
clearly. I'm just saying, it's much easier to design a system where hardware failures cause downtime, and where that downtime is made acceptable because hardware with redundant drives doesn't fail that often than it is to design a system where hardware failures don't cause downtime (and where a failed hard drive causes a full node failure)
Thus, to someone who isn't at scale, the supermicro systems are likely going to be cheaper, overall, than the backblaze pods. (It sounds like these people paid more per disk for the backblaze pods than I'm going to pay for the supermicro pods anyhow.)
I've also done three of these builds so far. Used the SC847A chassis with direct iPass cable access (i.e. no port multipliers), 4 drives per cable. Downside is you need 9x SFF-8087 connectors on controllers, and 9 iPass cables to somehow route. Don't get the SM iPass cables, TrendNet makes better ones. Upside is you have dedicated SAS2 bandwidth from the drive all the way though the controller and the PCIe bus. Likely overkill. Also a tip, you can mount 4x internal 2.5in or 2x 3.5in drives. SM has the part numbers for the brackets on the chassis' product page. Don't put anything you'd remotely want to hot swap in these brackets, they will be buried under the motherboard tray.
I've used 4x LSI Logic 9211-8i controllers plus the onboard of the SM X8DTH-6F. Both the onboard and the 9211-8i use the LSI 2008 chipset. I have Solaris and ZFS setting on top of these, so I don't have hardware raid.
This is actually very solid hardware so far. I had a PSU fail and thats it. Let me know if you have any questions.
I like the backblaze pod just in terms of using it as a comparison, where it is the absolute cheapest possible storage really. (Sure you could prob go cheaper.)
I think they've done something that makes sense for them. It doesn't make sense for me to use it for storage where I work, but it works for them in this instance. And it seems to work for Backblaze.
Projected cost from Backblaze was $7384, for drives at $120 each and other parts sourced in quantity. A lot of the expense for these guys came from the $5400 everything-but-the-drives kit from Protocase.
The drives were more expensive because a certain model of 3TB Hitachi drives are the parts recommended by Backblaze; they didn't get whatever happened to be on sale at Newegg that day. The drives were supposed to be $120 each in Backblaze's quantities. They ended up finding a sales rep at CDW.com who sold them for $129 in quantity, but they were more expensive elsewhere.
The box has 48 slots, so they must use 3 TB drives. They may be not completely crazy and use professional, not desktop drives. A 3TB pro drive costs about $200.
If you're sufficiently crazy to trust your data to a Backblaze pod, you're probably going to go full retard and use desktop disks anyway. If you cared about your data you probably wouldn't be building one of these things in the first place. :)
Also: depends what you're considering "pro drives", but enterprise SATA 3TB disks are in the $250 - $300 range -- about 2x the price of a similar consumer-grade disk.
Has anyone done any research to find a difference between "desktop" drives and "enterprise/pro" drives? I'm inclined to believe it's marketing speak and targeted at the same people that buy electron-aligned speaker cable.
Actually the difference is only in firmware; each drive is tested when reaching the end of the factory line, and the test results decide if it'll be desktop or enterprise. So physically the only difference is the label.
Different brands makes the separation more or less clear : WD desktop drives firmwares are explicitly crippled to be made almost unusable in RAID arrays (on the web and forums you'll find countless horror stories of lost arrays); Seagate and Hitachi desktop drives work about OK in RAID arrays, but you may have surprises at times.
So what's the difference? First, the desktop drive assumes to be alone. In case of a read or write error, it will try and retry to access the data for several long minutes (long timeout). Pro drives assume to be in RAID arrays, in case of an error they fail almost immediately not to block any outstanding IOs to the array.
Another difference is in vibration compensation. Desktop drive don't use their movement detectors to compensate from drive-induced chassis vibration, which in case of high IO will significantly reduce throughput by augmenting error rate.
If you (or someone you know) has a write up detailing these statements and tests showing them to be true, I'd be very interested in reading more about this. Please do share.
It sounds like what you're saying makes sense, from a drive business and manufacturing perspective.
> If you (or someone you know) has a write up detailing these statements and tests showing them to be true, I'd be very interested in reading more about this. Please do share.
This is NDA information :) I'm repeating what I've heard from drive makers and my experience after setting up several thousands of RAID systems.
I didn't made tests recently, but vibrations can kill an array performance. I've seen a chassis where the central drive slot (among 24) wasn't usable because it vibrated more than the others :)
The difference between desktop and nearline SATA is basically TLER and warranty. Nearline SAS adds dual ports, maybe better ECC, and maybe better IOPS.
If you were to buy 135Tb on Amazon EC2/EBS the same capacity would cost you ~$19,000 per month, not counting charges for I/O and bandwidth.