ZFS is the FS for Containers in Ubuntu 16.04

brendangregg · on Feb 18, 2016

This is great news, and we're already using ZFS in production on Ubuntu in a few areas at Netflix (not widespread yet).

Ubuntu 16.04 also comes with enhanced BPF, the new Linux tracing & programming framework that is builtin to the kernel, and is a huge leap forward for Linux tracing. Eg, we can start using tools like these: https://github.com/iovisor/bcc#tracing

tiffanyh · on Feb 18, 2016

What about FreeBSD?

Does that imply Netflix is transitioning away from FreeBSD? If so, why?

brendangregg · on Feb 18, 2016

Netflix cloud: AWS EC2, tens of thousands of cloud instances, mostly Ubuntu.

Netflix CDN (Open Connect Appliance): lots of physical boxes, FreeBSD.

tiffanyh · on Feb 18, 2016

Any reason why you don't use FreeBSD everywhere (or Ubuntu everywhere).

Why two different OS'es?

brendangregg · on Feb 18, 2016

It's really two questions: Why choose Ubuntu for the cloud, and, why choose FreeBSD for the CDN. We believe that's the best choice for both environments. I was trying to type in an explanation here, but that's really something that will take a lot to explain (maybe a Netflix tech blog post).

tzs · on Feb 19, 2016

If you do write that blog post, it would be cool if you not only covered the FreeBSD vs. Ubuntu aspects of the choice, but also the Ubuntu vs. other Linux aspects (particularly Debian).

vageli · on Feb 19, 2016

That would be a great post.

2trill2spill · on Feb 18, 2016

How does BPF compare to dtrace?

brendangregg · on Feb 18, 2016

If you browse some of the _example.txt files in https://github.com/iovisor/bcc/tree/master/tools , you'll see it's solving the same problems we used to solve with DTrace, plus a few extra. Here's a couple of the ZFS examples (since we're talking ZFS):

  # ./zfsslower 
  Tracing ZFS operations slower than 10 ms
  TIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
  06:31:28 dd             25570  W 131072  38784     303.92 data1
  06:31:34 dd             25686  W 131072  38784     388.28 data1
  06:31:35 dd             25686  W 131072  78720     519.66 data1
  06:31:35 dd             25686  W 131072  116992    405.94 data1
  06:31:35 dd             25686  W 131072  153600    433.52 data1
  [...]

  # ./zfsdist 
  Tracing ZFS operation latency... Hit Ctrl-C to end.
  ^C

  operation = 'read'
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 4479     |****************************************|
         8 -> 15         : 1028     |*********                               |
        16 -> 31         : 14       |                                        |
        32 -> 63         : 1        |                                        |
        64 -> 127        : 2        |                                        |
       128 -> 255        : 6        |                                        |
       256 -> 511        : 1        |                                        |
       512 -> 1023       : 1256     |***********                             |
      1024 -> 2047       : 9        |                                        |
      2048 -> 4095       : 1        |                                        |
      4096 -> 8191       : 2        |                                        |
  [...]

The current BPF interface we're using (bcc) is Python for the frontend, and C for the backend. It's currently much more verbose than DTrace, and involves writing 10x the lines of code. For some immediate use cases at Netflix, that's not a big problem, as staff will be using BPF via a GUI (Vector), not writing these tools directly.

There's also high level features it's still missing (like tracepoints and sampling), so what will be in Ubuntu 14.04 won't do everything, but it will do a fair amount: most of those _example.txt's. Some use a newer BPF interface (linux 4.5), and we've been putting the legacy versions in an /old directory specifically for Ubuntu 16.04 users.

cthalupa · on Feb 18, 2016

Does the current publicly released version of Vector support BPF? Or is there perhaps a PMDA that allows BPF support?

I'm following along with all of this pretty excitedly, and crossing my fingers for a Linux tracing book with BPF, ftrace, perf, etc. to read through and keep on my shelf next to your performance and dtrace books ;)

brendangregg · on Feb 18, 2016

Vector doesn't have BPF support yet. When it does we should be open sourcing it. As for a book, I'll try to get it done. :)

diptanu · on Feb 18, 2016

So good to see ZOL has come so far!

ThePhysicist · on Feb 18, 2016

ZFS is nice but as far as I understand the Linux version does not yet have support for copy-on-write clones using e.g. "cp --reflink=always", which to me was reason enough to choose BTRFS instead. Apart from this the two systems seem to be quite comparable (from my limited user perspective), with BTRFS having quite good Linux support as well. Maybe someone more experienced with the COW functionality could comment on that as it would be very interesting to hear how other people deal with this.

aidenn0 · on Feb 18, 2016

I've used both ZFS on both BSD and OpenIndiana, and I've used btrfs on linux. as recently as 4.0.5

As of 4.0.5 btrfs was IMO completely unusable as a daily file system. Some examples of issues I ran into:

1) System became unbootable with the version of btrfs I had installed and I had to use either an older or newer kernel to recover

2) I have a periodic backup of my mailbox that runs, and when it runs my system becomes completely unusable until it completes. The same script running on zfs on bsd and with ext4 or reiser3 on linux would show I/O slowdowns, but I could still use my machine.

3) In general I would run into other minor issues and the consensus in #btrfs was that since my kernel was more than 3 months old, it was probably fixed in the latest version, and why would somebody using an experimental filesystem not be tracking mainline more closely?

[edit]

To be fair, here's some issues with ZFS:

1) Do not ever accidentally enable dedupe on commodity hardware; it will slowly consume all your RAM if you aren't on a sun server (where 64GB of RAM is a resource constrained environment), and there aren't effective ways to undo dedupe, other than copying all the data onto a different pool.

2) You can't shrink a pool. Hugely annoying, apparently non-trivial to solve.

rsync · on Feb 18, 2016

"To be fair, here's some issues with ZFS:"

Let me add:

3) Do not allow a pool to exceed 90% capacity ... and probably don't let it exceed 85%.

ZFS does not have a defrag utility and it badly needs one. You can permanently wreck zpool performance by running it up past 90% capacity - even if you later reduce capacity back down to 75-80%. You can sort of fix it if you add additional top level vdevs to the zpool, thus farming out some IO to the new set of disks, but it's still going to be performance constrained forever. The only solution is to create a new pool and export the data to it.

This is unacceptable, by the way.

It is not at all reasonable to require a filesystem to stay below 80% capacity (our target "full" number at rsync.net) nor is it acceptable that hitting 90% is a (performance) death sentence.

When you consider that you might have already sacrificed 25% of your physical disk just for the raidz2/raidz3 overhead, being constrained to 80% means you're only using 60% or so of your physical disks that you bought.

ZFS needs defrag. Badly.

ryao · on Feb 19, 2016

If gang blocks are generated, you get more I/Os than necessary and an extra level of indirection, but the ZFS code base tries very hard to avoid that situation by switching to the allocator behavior at the metaslab level to best fit, such that most data written to the pool would not have gang blocks used at all. This is understandably terrible for IOPS, but it is likely the source of the performance degradation that you saw. You can probe for the zio_gang_* functions during a scrub to see if any gang blocks exist. On my pool that has exceeded 90% on multiple ocasioks, there are zero gang blocks and consequently, no permanent degradation from them. The only other problem that you might have (which tends to be caused by the order of writes rather than the fullness of the pool) is lower sequential performance from nonlinear block placement (one ZoL user measured this as cutting sequential reads in half on a pool filled with files made by bit torrent), but that is a much less severe problem, especially on solid state drives. If you want to fix placement, you can do a file copy or send/recv. The new locations should have blocks picked in sequence whenever possible.

cthalupa · on Feb 18, 2016

For ZFS to get defrag, someone with enough ability and knowledge would need to bite the bullet and go through the fairly massive and difficult task of adding in block pointer rewrite, which doesn't seem to be something anyone has been willing to do, and I've seen a lot of concerns about the actual feasibility of it from some very smart people that are knowledgeable with the codebase.

aduitsis · on Feb 18, 2016

Maybe off topic, but Alex Reece from Delphix had a blog post about a year ago about the ability to remove a disk from a pool.

http://blog.delphix.com/alex/2015/01/15/openzfs-device-remov...

This would imply some kind of rebalancing, if I understand correctly. Which maybe could be considered a form of indirect defragmentation.

XorNot · on Feb 18, 2016

Which also incidentally would solve the pool shrinking problem as well.

voltagex_ · on Feb 19, 2016

I wonder how much money you'd need to pay a very talented engineer to do the work. This is yet another thing I didn't read about before building a home NAS on ZFS.

po · on Feb 19, 2016

I had this happen to me... I accidentally allowed a pool to fill up to capacity and couldn't do anything with it because deleting files wouldn't free space due to snapshots and the commands to delete snapshots wouldn't work.

Then added a disk to it to try to recover. That worked, but only after adding a disk did I realize that I couldn't shrink the pool down again. I ended up moving the whole thing off to a new disk cluster and back again. Really painful.

aidenn0 · on Feb 18, 2016

That's a good point about the 90% capacity.

Also the same "technically challenging" problem that needs to be solved for shrinking pools would also make defrag possible.

l1ambda · on Feb 18, 2016

The Reiser guys were talking about this too. I think they needed defrag and were trying to get DARPA money to implement it, but it never happened.

binarycrusader · on Feb 19, 2016

Not all of the performance issues encountered when a pool exceeds 90% capacity are related to fragmentation.

Some implementers have been able to significantly reduce the pain of those situations.

ryao · on Feb 19, 2016

The main factor seems to be the best-fit allocator, which tends to go into action sooner from metaslabs crossing the 94% threshold earlier than the pool itself and still be selected due to LBA weighting, which is a trick to increase throughput on rotational media. This ought to help prevent best-fit allocation from occurring earlier than necessary on SSDs:

https://github.com/zfsonlinux/zfs/commit/fb40095f5f0853946f8...

That said, Delphix made changes to consider on fragmentation rather than space usage when the histogram is in use, so the performance before best-fit behavior goes into effect is better than outright selection of metaslabs by free space.

riquito · on Feb 18, 2016

I've been using btrfs since 2 years (Fedora) and the only problem I've ever run into was "no space left on device", solvable with a rebalance. btrfs also survived many hard poweroff.

zanny · on Feb 18, 2016

I'll pitch in with a neutral position. I've been using btrfs for four years, and in that time I've had unrecoverable fs corruption probably three times. This is on Arch, on bleeding edge kernels, where new releases are prone to regressions that break the filesystem.

But there was a tangible progression from instability towards increasing stability. I haven't had one lick of issue with btrfs in about a dozen kernel releases. I'm close to saying I'd trust in a production environment, since I use it everywhere else as a daily driver, I would just use an LTS release version just to be safe.

It is not all sunshine and roses, though. While Facebook employs several major btrfs developers, a lot of features that have been talked about for years still have not seen the light of day or any development whatsoever. lz4 compression, better checksum algorithms, per-subvolume encryption, online filesystem checking, and the Raid 5/6 support is still kind of garbage a year later. I worry that btrfs is suffering from a lack of interest in actually making the last legitimate pushes it needs, and code audits, and integration testing, to make it truly trustworthy.

But at the end of the day checksum integrity and COW are basically a game changer for me in terms of data integrity.

feld · on Feb 18, 2016

I tried to use the mirror functionality when it was new. I tested booting with one disk missing. Errors all the way. I went into IRC and chatted with btrfs folk about the "bug". Their response?

"Booting without all members of your mirror is unsupported."

dsr_ · on Feb 18, 2016

It's certainly supported now. mount -o degraded. It ought to be the default, though.

feld · on Feb 19, 2016

if it's not default yet it's a serious bug. a server should still come up completely with one disk in a mirror.

nisa · on Feb 18, 2016

If you use ZFS you only have snapshots for this. cp --reflink also has some gotchas on btrfs - if you do a balance it's not preserved and there are some odd problems with snaphots (I have no link at hand, take a look at the mailing list)

ZFS is not comparable to btrfs at the moment. Everything device related is missing on btrfs. No detection of missing or broken devices in btrfs at the moment, no hotspare functionality, btrfs RAID1 uses the pid to decide with disk to read from. RAID5/6 is still experimental and there are some odd behaviours.

Using btrfs for production is a risky bet and may very well bite you. The tooling is terrible at the moment (IMHO) and benchmarks favour ZFS most of the time.

ryao · on Feb 18, 2016

No ZFS version has support for that. It has been discussed and it might be implemented at some point.

That said, I would like to point out that ZFS' dataset level operations are more powerful than reflinks. ZFS' dataset level operations give separate independent snapshot and clone capabilities. They also provide the ability to rollback without killing things on top (which is useful in some cases). You cannot do that with reflinks. I suppose the immutable bit could be used to fix a reflink so that it retains the state at creation, but that is racy. In the case of virtual machines which seems to be a major application of reflinks, zvols are lower overhead and support incremental send/recv.

One benefit of reflinks would be that regular users can use it, but regular users should be able to snapshot, clone and rollback when delegation support is implemented.

barrkel · on Feb 18, 2016

Running hourly, daily, weekly and monthly snapshots is reason enough to choose ZFS on Linux for me. And ironically, it runs better on Linux than it did for me on Solaris - I used to get occasional pauses every few minutes when streaming media. Memory utilization for cache purposes isn't fantastic since it doesn't interact well with the rest of the kernel's logic, but everything else is pretty good.

baldfat · on Feb 18, 2016

I don't understand why BTRFS isn't gaining more support in Linux. I prefer BTRFS to ZFS with Fuse.

zzzzzthrowaway · on Feb 18, 2016

Probably rhymes with "btrfs balance".

I'v had the misfortune of using btrfs in production with a few hundred machines on Ubuntu 14.04. It's one of the most finicky filesystems I've ever used. It's probably better in newer kernels, but if you have a lot of churn it requires constant care and feeding and tends to cause kernel softlockups fairly commonly.

Also this: http://blog.pgaddict.com/posts/friends-dont-let-friends-use-... and this: https://www.phoronix.com/scan.php?page=news_item&px=CoreOS-B... and this (old but a few of the points still apply): http://coldattic.info/shvedsky/pro/blogs/a-foo-walks-into-a-...

That being said I'm sure there are plenty of use cases for which btrfs in its current capacity is more than sufficient.

robszumski · on Feb 18, 2016

At CoreOS and we tried really hard to make btrfs happen, but it really came down to how different it operated than other file systems. It was mainly a UX issue and thus fell into my lap.

The major issue is that regular debugging tools that folks have been using forever like `df -h` aren't just non-functional, they actively misrepresent the state of the file system. The most common example is indicating that you have plenty of free space when in fact you're out. We had to write a lot of documentation to teach people how btrfs works and how to debug it: https://coreos.com/os/docs/latest/btrfs-troubleshooting.html

The second major issue is that rebalancing requires free space, which is the problem that most folks are trying to fix with a rebalance operation. Catch-22 in the worst way. Containers vary in size and can restart frequently, churning through the btrfs chunks without filling them up, leaving around a lot of empty space that needs to be rebalanced.

ryao · on Feb 19, 2016

I hit that rebalancing needs space (and therefore can ENOSPC) issue at work when trying to compile ZFS on CoreOS on Digital Ocean before CoreOS switched from btrfs to ext4 and overlayfs. Getting ENOSPC on btrfs rebalance when you are seeing regular writes return ENOSPC is a really annoying problem.

nisa · on Feb 18, 2016

> I'v had the misfortune of using btrfs in production with a few hundred machines on Ubuntu 14.04

You are not alone. btrfs seems to be kind of stable - as in does not corrupt itself anymore - with 4.2 but it's been a nasty ride.

It's an experimental filesystem that is neither complete nor stable yet. I wish this would be better communicated.

It's needless frustrating: If you search for btrfs you come across a few slide decks that tell you: It's fine you can use it... after the first strange problems you'll subscribe the mailing list and every other day there is some post that shines some light into strange behavior and stuff that is not implemented.

If you want checksumming on your single hdd backup disk btrfs is fine. For everything else you are up to some surprises... basically everything volume management and RAID is pretty much experimental and has strange behavior.

Performance is not even a topic. I remember the ML discussion on this OLTP blogpost and the majority of responses was: Don't run databases on btrfs you stupid! I'd rather would read a technical discussion about the problems but from reading the ML it seems like it's too complex and few understand the complexity.

@bcantrill called it a shit-show in some podcast and while it maybe technically not true it sure does looks like that.

If you want peace of mind use mdraid+ext4 (or xfs) - ZFS on Linux has a lot of problems for heavy usage but the community is IMHO more invested in making it a good Linux citizen.

On the other hand: This stuff is complicated and everyone expects miracles. I'm just looking at it from sysadmin perspective and on Linux both suck at the moment. But ZFS won't eat your data and has far better tooling.

If you need something that works for high load on Linux I'd use neither.

wstrange · on Feb 18, 2016

This isn't ZFS in Fuse - it is a native kernel module.

BTRFS has yet to reach the stability of ZFS.

jlappi · on Feb 18, 2016

I wonder if the licensing of the ZFS kernel module will cause any issues with its inclusion in Ubuntu?

l1ambda · on Feb 18, 2016

Canonical's lawyers OK'd it which is why they are including it.

nailer · on Feb 18, 2016

Weird. What changed in the last 10+ years?

5ilv3r · on Feb 19, 2016

Nothing. It's the same as nvidia's non-gpl kernel modules. The simple fact is they will never be accepted upstream, but that matters little to distributors of ubuntu's scale.

ryao · on Feb 19, 2016

In the case of Nvidia's modules, Nvidia's proprietary licensing disallows distribution of a prebuilt nvidia.ko (as that implies distributing a modified version). Coincidentally, their license terms for the OpenSolaris driver have no such restriction and the OpenSolaris descendants distribute the prebuilt module without potentially violating Nvidia's license terms.

Amazingly, their Linux licensing used to be worse. They used to claim you were only permitted to install the driver on one computer within an organization

stubish · on Feb 19, 2016

Someone hired better lawyers. All these ridiculous EULA and click through licences and idiotic mandatory registration systems we see, I can't help think many companies would benefit from hiring better lawyers. Get rid of the timid who default to 'no' in order to protect their own arse, hire people who help you get where you want to go.

In this case, I can't even see any real liability issues - even if Canonical did get taken to court there are no damages since the software is free of charge.

nailer · on Feb 19, 2016

> no damages since the software is free of charge

Many people can and do charge a lot of money for OSS products.

ryao · on Feb 19, 2016

People decided to listen to lawyers who read the licenses instead of listening to statements by people claiming to know how things work without actually reading either license or asking a lawyer about it. That is quite literally the only change.

nailer · on Feb 20, 2016

Didn't Sun use lawyers to design the CDDL to precisely prevent this situation?

e12e · on Feb 23, 2016

No, I've seen talks by the engineers behind Solaris (I don't recall who at the moment) that strongly indicated the Sun lawyers didn't go out of their way to be incompatible with the GPL -- they just wanted a license that allowed them to split proprietary and open code (as they didn't have the right to open up all of Solaris due to licensing agreements with third parties etc) -- and still being able to distribute both open and traditional closed Solaris. This led to the "per file" license nature of the CDDL -- and unfortunately to the "additional limitation"-bit that makes it incompatible with the GPL.

If it was done again today, they might have gone for the Apache license as I recall -- and avoided some of the unfortunate issues.

jvoorhis · on Feb 19, 2016

Pretty sure Canonical saw an opportunity with container management, plus interest in ZFS, decided to get over the unfortunate licensing issue and support the module.

usefulcat · on Feb 18, 2016

I check in on btrfs every 6-12 months or so. To date it has always seemed too unreliable compared to ZFS. Lack of decent RAID 5/6 support is another major difference.

plus · on Feb 18, 2016

Ubuntu 16.04 will have ZFS support in the kernel, via a kernel module. They are not using FUSE.

rodgerd · on Feb 18, 2016

I have been using btrfs for a couple of years. The nicest thing I can say about it is that it is both conceptually elegant and has made my backup practises bulletproof.

XorNot · on Feb 18, 2016

ZFS on Linux runs natively though. This is not the fuse version.

TallGuyShort · on Feb 18, 2016

Granted, it's been a couple of years, but I tried BTRFS when openSUSE made it their default and I had filesystem issues that I've never had on anything else. I'm sure it's progressed a lot since then, but I expect it will be behind ZFS for a long time in terms of stability.

ryao · on Feb 19, 2016

SUSE seems to recommend XFS for production data and btrfs for the rootfs that is a read mostly workload that is unlikely to trigger ENOSPC. It is not what I would consider a great endorsement of btrfs.

barrkel · on Feb 18, 2016

ZFS on Linux (i.e. kernel module, not with fuse) is very stable in my experience. I have not heard the same about btrfs. I tried ZFS on fuse once, but the performance was abysmal.

ryao · on Feb 19, 2016

ZFSOnLinux does not use FUSE.

melted · on Feb 18, 2016

Apple should replace the abomination that is HFS+. They almost did a few years ago, but pulled the code at the last moment.

e1ven · on Feb 18, 2016

Apple's choice to drop ZFS may have been influenced by the lawsuit Sun was dealing with at the time.

http://www.computerworld.com/article/2539287/data-center/app...

I'd wager they wouldn't want to adopt ZFS without an explicit license and legal indemnity from Oracle.

ryao · on Feb 19, 2016

I heard from a former Sun executive that Apple wanted indemnification following netapp's lawsuit. They had spent a long time negotiating over that before they had something mutually acceptable and it was supposed to be signed the day Oracle's acquisition of Sun finished. That left it up to Larry Elison, who refused to sign it and Apple decided to try its luck improving HFS+.

zanny · on Feb 18, 2016

They could adopt btrfs. Darwin is open source, right?

melted · on Feb 18, 2016

BTRFS is still not mature, and there's a license incompatibility as well. And it's controlled by Oracle all the same. If Apple replaces the filesystem they'll likely roll their own.

zanny · on Feb 18, 2016

btrfs is not controlled by Oracle for one (its principal developers are employed by Facebook, but its still regular Linux GPLv2 code) but I did check and the APL is incompatible with the GPL.

And they obviously aren't going to make a new filesystem. That doesn't get them sales like higher resolution screens or changing the color theme... again.

ghaff · on Feb 19, 2016

And they obviously aren't going to make a new filesystem. That doesn't get them sales like higher resolution screens or changing the color theme... again.

Introducing a new filesystem would be a big decision for Apple. There would doubtless be all sorts of migration and compatibility issue, even aside from the work it would take. Especially given where we are in the maturity of desktop clients, it makes a lot more sense to incrementally improve the current filesystem. I'm not sure how snarky you intended to be, but no there aren't many sales in a complex undertaking that is far more likely to cause data corruption and migration issues than concrete benefits for 99.9% of users.

melted · on Feb 19, 2016

Looks like you're right. Oracle used to control it (as evidenced by (c) Oracle all over the place), but that doesn't seem to be the case anymore.

ghaff · on Feb 19, 2016

I'm not sure it's accurate to say it was "controlled" by Oracle but a lot of--though certainly not all--active development came out of Oracle at one point, notably by Chris Mason (who is now with Facebook).

melted · on Feb 19, 2016

Whoever employs the core team, controls the project. For a while that was Oracle.

beagle3 · on Feb 19, 2016

Dragonfly BSD's HAMMER2, when it is even half done (that is, stable for one node), is probably a much better (technical) choice than BTRFS or improving HFS+, and probably a much better legal choice than ZFS.

mark_l_watson · on Feb 18, 2016

Sorry in advance if this is a stupid question: my main Linux system is a laptop with a small SSD drive. I would like to organize my entire digital life on a 2 TB external USB drive, and be able to maintain a clone of everything on at least one other 2 TB USB drive.

Is ZFS the right tool for this?

berdario · on Feb 19, 2016

I read some time ago that ZFS is definitely NOT the right tool for laptop/external storage, unless you actually have a zpool with mirroring/raidz (which means you have to always keep the devices connected).

The reason is that when ZFS detects corruption, it'll lock down the whole fs... and prevent reading/recovery data from it, as recovering data from raidz is the expected solution in that case.

I tried to google again for the description of this issue, but I couldn't find it... I found this otoh:

https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs...

And things aren't that obvious, apparently:

> Even without redundancy and "zfs set copies", ZFS stores two copies of all metadata far apart on the disk, and three copies of really important zpool wide metadata.

Whichn means that this might not actually be a problem after all

ryao · on Feb 20, 2016

> The reason is that when ZFS detects corruption, it'll lock down the whole fs... and prevent reading/recovery data from it, as recovering data from raidz is the expected solution in that case.

ZFS has duplicate metadata by default, so it can recover from corrupted metadata blocks unless too much is gone. If the data blocks are corrupted and there is no resundancy, you should get EIO. There is no code to "lock down the FS", although if you have severe damage (like zeroing all copies of important metadata or losing all members of a mirror), it will die and you will see faulted status. That is a situation no storage stack can survive and is why people should have backups.

Freaky · on Feb 19, 2016

> The reason is that when ZFS detects corruption, it'll lock down the whole fs... and prevent reading/recovery data from it

Depends on what exactly is corrupt, but for file corruption it's generally just a case of warnings in logs/zpool status (which will suggest restoring the file from backup), and IO errors trying to access that specific file. The pool itself should remain intact and online.

It's less clear cut if it's important metadata that's damaged, but as you mention, ZFS is quite aggressive about maintaining multiple copies even on standalone devices.

ryao · on Feb 20, 2016

I have backups stored on a double mirror of USB drives. The USB interface is fragile, but it does work. I cannot say that I recommend USB drives though, but if you are using USB, ZFS is not at any disadvantage versus other filesystems.

For what it is worth, I am this ryao:

https://github.com/zfsonlinux/zfs/graphs/contributors

lmm · on Feb 18, 2016

If you're talking about a "static" setup where you attach both at the same time or not at all, yes. ZFS export before unplugging, ZFS import when plugged in, I can see it working very nicely.

If you're talking about using one of them most of the time and syncing occasionally then any filesystem will do, you'll want a user-level tool for doing the sync (probably - sibling did mention zfs send which I don't have any experience with).

dsr_ · on Feb 18, 2016

It's a tool that could work. You can use zfs send/receive to clone a zfs filesystem or snapshot.

You could also use rsync, duplicity, or a bunch of other tools.

The major zfs advantage here is that all your files get integrity checking.

nisa · on Feb 18, 2016

Let's see how this works out. It's probably better and more stable than btrfs but this is not complicated...

ZFS on Linux had issues with ARC (especially fast reclaim) and some deadlocks and AFAIK cgroups are not really supported - e.g. blkio throttling does not work.

Would be great is they got this ironed out but I would be wary. Still great news!

AstralStorm · on Feb 18, 2016

Additional problem is that in-kernel latencies of both btrfs and ZFS are on the high end. Essentially a show stopper for professional audio work and maybe some kinds of video streaming. Trying to completely escape disk IO in those uses is very limiting.

A comparable solution using LVM and/or mdraid with ext4 on top has much better latency behavior.

Sorry for no benches for you, but feel free to run a quick check using latencytop and ftrace. Phoronix has some performance comparisons if you want them.

semi-extrinsic · on Feb 18, 2016

> Essentially a show stopper for professional audio work (...). Trying to completely escape disk IO in those uses is very limiting.

Could you expand on that?

I mean, an hour of mono uncompressed 192 kHz/24bit audio is almost exactly 2 GB. Compared to professional audio equipment, 128 GB of RAM isn't very expensive ( < $2000), and that would let you keep 64 one-hour maximum-def tracks in memory. Why do you need to read from the disk with any frequency?

shmerl · on Feb 18, 2016

OpenZFS, not ZFS. Those are different beasts now.

ZenoArrow · on Feb 18, 2016

Worth pointing out the licence is the same for both.

shmerl · on Feb 18, 2016

Not at all. ZFS is not even FOSS anymore.

ZenoArrow · on Feb 18, 2016

Fair enough.

heavenlyhash · on Feb 18, 2016

This is great news. Among other incentives, ZFS has some truly excellent features for improving reliability. ZFS's built-in checksums, for example, can result in much happier stories during the onset of disk failures: where a RAID array can quietly return incorrect sector contents without noticing, and be unable to correctly differentiate between the correct and not-so-correct sectors in the event of disk loss followed by disagreements discovered during rebuilds, ZFS simply does the right thing by making checks during normal operations, and uses the same checks to confidently do the right thing during recovery. And snapshotting. Oh, snapshotting.

On the other hand, I've always wished we could get a modern re-take on ZFS. As anyone who's tried it will tell you: dedup in ZFS essentially doesn't work. ZFS, internally, is not built on content-addressable storage (or, it is, but since splitting of large files into blocks doesn't take any special actions to make similar blocks align perfectly, it doesn't have anywhere near the punch that it should). As a result, dedup operations that should be constant-time and zero memory overhead... aren't. Amazing though ZFS is, we've learned a lot about designing distributed and CAS storage since that groundwork was laid in ZFS. A new system that gets this right at heart would be monumental.

Transporting snapshots (e.g. to other systems for backups... or to "resume" them (think pairing with CRIU containers)) could similarly be so much more powerful if only ZFS (or subsequent systems) can get content-addressable right on the same level that e.g. git does. `zfs send` can transport snapshots across the network to other storage pools -- amazing, right? It even has an incremental mode -- magic! In theory, this should be just like `git push` and `git fetch`: I should even be able to have, say n=3 machines, and have them all push snapshots of their filesystems to each other, and it should all dedup, right? And yet... as far as I can tell [1], the entire system is a footgun. Many operations break the ability receive incremental updates; if you thought you could make things topology agnostic... Mmm, may the force be with you.

[1] https://gist.github.com/heavenlyhash/109b0b18df65579b498b -- These were my research notes on what kind of snapshot operations work, how they transport, etc. If you try to build anything using zfs send/recv, you may find these useful... and if anyone can find a shuffle of these commands with better outcomes I'd love to hear about it of course :)

ryao · on Feb 19, 2016

The deduplication code works, but each deduplication operation requires 3 serial IOS to lookup the information needed to check if deduplication is possible and if they are in aches, that becomes painful fast on storage with low IOPS. On my workstation where I have enough memory that the results of all of the lookups naturally fit and high IOPS storage, the deduplication code runs well. You would have a similar problem designing a system that perfectly deduplicates data at the record level if you tried.

infogulch · on Feb 19, 2016

I was thinking about this. To reduce both the huge ram usage and serial IOs you could use something similar to a Bloom filter to quickly test whether you should attempt to dedup a new block. If the filter says it's not a duplicate, then completely skip the standard (slow) dedup path.

Bloom filters specifically have issues: they don't permit removing entries for one, and they're not really that efficient. But there's a paper about Cuckoo Filters which seems to solve both of these problems. For example:

    The "semi-sort" variant of the cuckoo filter benchmarked in the paper has a size of 192 MB and holds 128M items.
    So for 8kb blocks, it can dedup 1TB of blocks. More if you increase the block size or the size of the table.
    It has a 0.09% false-positive rate (!). I.e. unique blocks would use the slow path to test for duplication in vain only once in 1111 writes.
    The algorithm can perform 6 million lookups per second on the benchmark hardware. (2x Xeons at 2.27GHz, 12MB L3, 32 GB DRAM)

This is assuming that the majority of writes are actually unique, and dedup is more of a "it would be nice" thing than essential. But for that case something like this would be a lot easier to implement and use far fewer resources. Just stick it in front of the existing dedup lookup and early-exit if the filter says it's not duplicate.

[0]: https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf

bryanlarsen · on Feb 18, 2016

Note that ZFS isn't magic. Even with ZFS's checksums on read, you should still be doing regular scrubs, just like you should be with LVM or btrfs. And once you have regular scrubs, checksums on read don't really add much.

heavenlyhash · on Feb 18, 2016

Agreed, turning on more safety features will always make you... safer :)

But it's worth noting that I've debugged corruptions in prod systems where:

- corrupted data was read from disk -- a bit flip, with no error code at the time -- by an application

- the application operated on it

- and the application then wrote the result -- still carrying the bit flip -- back to a new file on disk.

Ouch. The bitflip is now baked in and even looks like a legit block as far as the disk is concerned. The disk failed not long after, of course -- SMART status caught up, etc. But that was days later.

Checksums on read address this. I never want to run a system without them again.

bryanlarsen · on Feb 19, 2016

I don't understand. If the bit was flipped before or during read the scrub would catch it. If it was flipped by the application then no file system can help. How does read checksums help you?

mrighele · on Feb 18, 2016

There is still the chance that data get corrupted between the time the scrub is performed and the time you read the data, so I don't consider scrubs sufficient. In any case you are right, they should be performed even with ZFS, expecially to test data that is rarely or never read back.

bryanlarsen · on Feb 19, 2016

Sure, there can be one error between scrub and read. But assuming RAID, you need errors on two disks. That can happen in a week or however often you scrub, but that's going to be pretty low probability.

heavenlyhash · on Feb 25, 2016

You assume that your RAID implementation is going to actually read all of its parity bits from each of the disks, and check them for agreement, before returning a value to you.

You may want to validate that assumption. :)

(Prepare for an unpleasant surprise.)

weitzj · on Feb 18, 2016

And what about machines without ECC RAM? I thought this is the idea for using ZFS in the first place. Or is the ECC "requirement" only important for raidz?

takeda · on Feb 18, 2016

The whole ECC is required by ZFS is a bit misunderstanding.

ZFS puts guarantees that your data will be safe, but if has no power to help you if your data gets corrupted in memory. The ECC is the last piece needed to guarantee data safety.

So if you don't have ECC your data is still safer with ZFS than traditional file-systems, ECC just increases the safety further.

duaneb · on Feb 19, 2016

How does corrupted memory affect ZFS's performance? Much of the replication state is stored in memory; is it possible you could lose data from a single bit being flipped?

lomnakkus · on Feb 19, 2016

AFAIUI, yes. If a single bit in the root's checksum is flipped just before being written to disk you'll probably lose the whole file system.

empthought · on Feb 19, 2016

You understand incorrectly.

http://arstechnica.com/civis/viewtopic.php?f=2&t=1235679&p=2...

http://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-ram-kill-yo...

http://blog.brianmoses.net/2014/03/why-i-chose-non-ecc-ram-f...

etc.

lomnakkus · on Feb 19, 2016

Could you be more specific? You seem to just be linking to random posts about ECC vs. non-ECC. I don't see anything specifically there about the root of the file system.

(I'll happily grant that this scenario is so unlikely as to be impossible for all practical purposes, but having skimmed the stuff you linked to I don't see why it couldn't happen theoretically.)

ryao · on Feb 20, 2016

If I were creative, I could probably kill any filesystem with 1 bit flip if I could control the bit.

Filligree · on Feb 18, 2016

Without ECC RAM, you're far more likely to get uncorrectable / unnoticeable corruption.

This is not unique to ZFS, and it doesn't make ZFS worse than other filesystems. But since the reason you'd use ZFS is often to avoid any corruption, it's tradition to advise the use of ECC.

wstrange · on Feb 18, 2016

I learned this the hard way on an old server that did not have ECC.

I had a file server happily ticking away using ext4.

Converted it to ZFS - and a week later got file system corruption reported. Ran a very extensive memory test - and sure enough I had bad RAM (but it took 2-3 days for the errors to show up).

In the wild there has to be a ton of corruption that just never gets discovered without end to end checking.

If you have a large jpgs or MKVs - a flipped bit here or there is not going to be apparent.

ptaipale · on Feb 18, 2016

Why would ZFS be worse than ext4 or anything in this way if you don't have ECC?

Genuine question, I don't understand this claim. As far as I can see, ZFS provides protection against some types of failures on disk, which ext4 doesn't. ECC has no impact on that, it protects another dimension.

ben0x539 · on Feb 18, 2016

I read this as they've been having corruption happening all along, but only ZFS detected it?

wstrange · on Feb 19, 2016

Yes - exactly.

To be clear, ext4 did not cause the corruption - but neither did it detect it or correct it. It happily sent corrupt blocks out to disk.

I think this happens a lot more frequently than people realize.

FullyFunctional · on Feb 19, 2016

One reason is that you generally perform scrubbing, thus you are potentially rewriting data which would otherwise be at rest. If your memory is bad, this could replace good data with bad. FS that doesn't scrub doesn't have this issue.

tobias3 · on Feb 18, 2016

No it is worse with ZFS because it doesn't have a fsck tool. If you have a bit flip in the ZFS metadata you have to export and re-import your whole pool to get it to a writable state again.

Freaky · on Feb 18, 2016

Meanwhile fsck on a traditional filesystem will gleefully mangle data that's actually fine in face of transient corruptions.

I've had this happen more than once, both with bad RAM and bad IO controllers - previously fine static data suddenly being detached from the filesystem and appearing in little bits in lost+found, because bit flips effectively causes it to hallucinate problems to "fix".

ryao · on Feb 20, 2016

This is wrong. But flips can do many different things:

http://open-zfs.org/w/index.php?title=Hardware&mobileaction=...

Export/import is not a solution to most of those. In the cases where it does work, umount/mount is likely all that is needed.

cthalupa · on Feb 18, 2016

http://open-zfs.org/wiki/Hardware#Background

This would seem to indicate that this sort of thing is exceedingly rare and similar issues would have similar effects on other filesystems.

pmarreck · on Feb 19, 2016

Resilvering (which is basically a global data verify, similar to fsck) will fix bits flipped the wrong way via error correction, assuming you've set ZFS up that way. Are you saying this doesn't apply to the metadata?

ryao · on Feb 20, 2016

A simple scrub will repair blocks that have checksum failures, but there's is no guarantee that the checksum was calculated before the bit flipped if it occurred in a buffer being written.

heeen2 · on Feb 18, 2016

What is scrubbing and resilvering if not a fsck equivalent?

tobias3 · on Feb 18, 2016

Scrubbing corrects on disk bit flips. An in-memory bit flip (which is more rare than on disk bit flips even with non-ECC memory) can corrupt a in-memory data structure which is later written to disk to all replicas, i.e. scrub will not detect it. If this corrupted data structure is later loaded and used this may cause all kinds of problems and there is no tooling to correct it.

michaelmrose · on Feb 19, 2016

Scrub?

frik · on Feb 18, 2016

So is ZFS more prone to defects without ECC than let's say EXT4 or NTFS?

I would like to buy a notebook with ECC ram, but Intel doesn't care.

usefulcat · on Feb 18, 2016

No, no one is saying that ZFS is more prone to defects without ECC. Lack of ECC increases the risk of corruption for any filesystem. The reason you hear more about ECC in the context of ZFS is that data integrity is a key feature for many who choose to use ZFS.

takeda · on Feb 18, 2016

I don't understand why this is so often misunderstood.

The parent poster already stated the opposite.

ECC and ZFS are orthogonal. ECC ensures that data in your RAM is not corrupted (or rather detects corruption) it helps whether you use ZFS, EXT4, NTFS etc.

ZFS increases your data safety whether you use ECC or not, but if you have to have maximum assurance that data is fine you should use ZFS and ECC.

empthought · on Feb 19, 2016

As far as I can tell, the ECC/ZFS paranoia traces back to "cyberjock" on the FreeNAS forums. https://forums.freenas.org/index.php?threads/ecc-vs-non-ecc-...

ryao · on Feb 20, 2016

This is correct. He considers ECC so important that he is willing to spread FUD about non-ECC behavior to try to scare people into using ECC. I think the truth is scary enough to convince people, but he does not agree.

RotsiserMho · on Feb 18, 2016

No, all are equally prone to errors if a bit in RAM is unexpectedly flipped. My understanding is that ZFS requires more RAM and possibly more CPU than other filesystems and those costs aren't worth it if you're going to use RAM that can't detect errors anyway.

X-Istence · on Feb 18, 2016

To answer your question: No.

They are all just as prone to defects.

Sanddancer · on Feb 18, 2016

They do. There are now Xeon E3 notebooks that have ECC. It's something new they added with Skylake for that reason. They're designed as mobile workstations, so you're gonna be looking at at least $1700 for a laptop, but good quality's always expensive.

ryao · on Feb 19, 2016

ECC is just as important for EXT4 and NTFS as it is for ZFS.

zanny · on Feb 18, 2016

http://blog.codinghorror.com/to-ecc-or-not-to-ecc/

At least Jeff doesn't believe much in ECC anymore.

ryao · on Feb 20, 2016

ECC is a prequisite for any filesystem to behave as intended. ZFS needs ECC just as much as other filesystem does. e.g. ext4, XFS, etc.

e40 · on Feb 18, 2016

Anyone know of Red Hat has plans to include it in default installations?

cjbprime · on Feb 18, 2016

Opinion: Red Hat won't because it has better lawyers.

rwmj · on Feb 18, 2016

... and billions more dollars at stake.

wstrange · on Feb 18, 2016

Oracle could put this whole legal wrangling to rest by changing the license to GPL.

I am dreaming I suppose.

protomyth · on Feb 18, 2016

They don't own the copyright on the patches, so it wouldn't exactly be effective.

dmm · on Feb 18, 2016

They might not own the openzfs patches but they definitely own all of the zfs code. They had all contributors sign CLAs to reassign copyrights. That's how they were able to end opensolaris.

protomyth · on Feb 18, 2016

I'm pretty sure ZFS as last released is not as valuable without the OpenZFS contributions.

ryao · on Feb 20, 2016

This is correct.

cthalupa · on Feb 18, 2016

From my understanding, Sun back in the day, and Oracle now, cannot release everything under the GPL due to contractual obligations. There's a reason they wrote the CDDL in the first place.

I can't find the talks now, but I believe Cantrill and others have spoken about this previously.

My memory is somewhat fuzzy, so I might be wrong on this.

ryao · on Feb 20, 2016

That was Solaris and that was regarding putting all of it under an OSS license, not necessarily the GPL. There were a few tiny bits that they just could not open source.

Oracle could release their fork of ZFS under any license they wish.

bsharitt · on Feb 18, 2016

I'm pretty surprised we haven't seen an official Oracle version of ZFS released with Oracle Linux.

wstrange · on Feb 18, 2016

I think there was an internal schism at Oracle regarding BTRFS vs. ZFS.

The Solaris fans (generally from Sun) preferring / promoting ZFS, and hard core Linux fans (mostly Oracle) - who tended towards BTRFS.

bsharitt · on Feb 18, 2016

That's right, I forgot, btrfs began as an Oracle project didn't it?

dekhn · on Feb 18, 2016

I'm a conservative user, so I don't change my filesystem until my preferred distribution (Ubuntu) supports installing an FS to the root with a provided, supported kernel module. This is a huge deal for me; I will probably install a new FS on my main file server and move from ext4 to zfs.

xenophonf · on Feb 18, 2016

In case you ever feel less conservative, here's how I set up my MacBook Pro to dual boot Mac OS X/FileVault and Ubuntu/ZFS/LUKS:

https://gist.github.com/xenophonf/2e2d1a1550b0fb8dae98

acd · on Feb 18, 2016

ZFS is great, happy its included in Ubuntu 16.04 by default.

obelisk_ · on Feb 18, 2016

Anyone know if this applies for Lubuntu as well? I use Lubuntu on my rencently bought new desktop for the default of LXDE. I intend to upgrade my desktop on which I put Lubuntu 15.10 (I bought a computer without OS so as not to pay the Windows tax) to Lubuntu 16.04 because I understand it'll be based on LXQt, the successor to LXDE (and it's not a matter of newer is better -- I am a fan of Qt) and also because I think Lubuntu 16.04 will be an LTS release and I've been very happy with the stability of Lubuntu 14.04 LTS which my previous main computer was and is running.

jcastro · on Feb 18, 2016

Yes, all Ubuntu derivatives use the same repository so you should be set.

obelisk_ · on Feb 19, 2016

Thanks :)

jgworks · on Feb 18, 2016

Does this include a fix for this bug?

https://github.com/zfsonlinux/zfs/issues/1548

faragon · on Feb 18, 2016

Check if this works for your case:

echo > myfile

P.S. not the "solution", but may help in case you fill the disk by accident, and need to make some room before a remount cycle. P.S.2. this could help also when your disk is already 100% full, without enough space for deleting files -not enough space for new inodes- (I tested that case on ZFS NAS with no left space at all, and worked).

ansible · on Feb 18, 2016

Huh, well that's interesting news.

I'm currently setting up a couple servers using LXC with btrfs.

I ending up choosing LXC (as opposed to LXD, docker, rkt, etc.) because I wanted something relatively straight-forward. I just wanted some containers I could create, log in to and configure.

If this was a bigger deployment, I'd take the time to use docker or something else. But for now, just being able to get going quickly is nice. For backup / failover, I can btrfs send / receive the containers to another host and start them there.

stubish · on Feb 19, 2016

Yeah, thats all fine and good. Nothing to announce with lxc/lxd + btrfs because it already works fine :) I do like the wizard to easily setup the zfs backend, rather than you needing to manually replace /var/lib/lxc with a btrfs partition or however you are doing it.

I've been using lxc + btrfs daily for quite a while, setting up and tearing down hundreds of containers on a busy day. I stopped using lxc snapshots after I had a btfs subvolume that would crash the system when I mounted it. After that, no problems.

ansible · on Feb 19, 2016

I stopped using lxc snapshots after I had a btfs subvolume that would crash the system when I mounted it. After that, no problems.

That's unfortunate. What operating system version were you using at the time?

I've actually switched to using Ubuntu 15.10 for the container hosts so that I can get a more recent version of the btrfs tools. The intention is to upgrade to 16.04 as soon as is reasonable, and leave them there for a long time.

baldfat · on Feb 18, 2016

BTRFS uses less memory so I prefer BTRFS to ZFS for containers.

ryao · on Feb 20, 2016

How did you determine that? The two should use about the same amount of memory. The only diffence is that ZFS uses ARC and btrfs uses the page cache. ARC is not reported the same way as the page cache though, which might give the appearance of requiring more.

wmf · on Feb 18, 2016

This post only mentions LXD. Will ZFS be used for Docker?

klapinat0r · on Feb 18, 2016

Whether docker in aptitude will be configured to use zfs by default is impossible to guess. It is however possible to say that ZFS is already supported by docker, e.g.: https://docs.docker.com/engine/userguide/storagedriver/zfs-d...

awalton · on Feb 18, 2016

Docker switched to Alpine Linux by default, so there's that hurdle... Not to mention the giant legal question mark of loading CDDL code into a GPLv2 kernel.

_euvw · on Feb 18, 2016

One of the ideas of containers / virtualization is that the host operating system (Ubuntu, in this case) has as little to do as possible with the VM / container (Alpine, in this case).

Running an Alpine container using Docker on an Ubuntu host will work just fine.

mentat · on Feb 18, 2016

That's for the container package management system, not the host storage system.

sengork · on Feb 19, 2016

I wonder how this would work under shared storage model for container clusters (eg. Mesos). NFS does not cut it for all use cases (eg. DBs).

satbyy · on Feb 18, 2016

For those who missed it, Debian Project Leader Neil McGovern gives details [1] about how licensing issues were resolved so that ZFS can be in Debian now. It is distributed as source-only dkms module.

[1]: http://blog.halon.org.uk/2016/01/on-zfs-in-debian/

cjbprime · on Feb 18, 2016

Though that's not relevant to this news -- Ubuntu isn't using DKMS, they put it directly into their kernel tree, violating GPLv2.

_hyn3 · on Feb 18, 2016

This appears to be inaccurate. They appear to be distributing it as a binary module, outside of the kernel tree, just like many other binary-only kernel modules.

See zfs.ko here: http://blog.dustinkirkland.com/2016/02/zfs-is-fs-for-contain...

"However, there is nothing in either license that prevents distributing it in the form of a binary module or in the form of source code.":

http://zfsonlinux.org/faq.html#WhatAboutTheLicensingIssue

The binary kernel image is a separate issue from the kernel packages (deb); they could include multiple files in a kernel package (deb) that are licensed under different licenses.

awalton · on Feb 18, 2016

Regardless of how the binary is shipped (which could be legal), their aggregation of the source is likely not, since it almost certainly is a derivative work at that point. The fact they have a public git repo where the two codebases touch is probably enough to bait a lawsuit from Oracle, in that they're distributing CDDL code in a way that is against its license.

There's a reason LLNL developed their branch out-of-tree - it's just not worth the legal headaches to aggregate the source like Canonical just did.

hhw · on Feb 18, 2016

The CDDL does not prevent ZFS from being used in Linux. It's the GPL that prevents using CDDL code in the kernel. Oracle doesn't have any grounds to sue based on their IP. They would only be able to sue on behalf of Linux, and violation of the GPL. Although I wouldn't rule that out, it's somewhat less likely to happen.

On the other hand, I suspect RMS isn't too happy with this turn of events. The sfconservancy may be the more likely party to bring a lawsuit. I'm curious to see either of them comment on the situation.

l1ambda · on Feb 18, 2016

The GPL doesn't even really prevent it, the only clause is a "derived work" which is quite a stretch that a court would find a module to be such a thing.

belorn · on Feb 19, 2016

How would you explain to a non-technical person that a kernel module is not a derived work?

Can you remove the Linux kernel and still have a complete and working program? What happen if one removed all function calls to the Linux kernel or use of internal kernel variables? As a module, does it work with any other kernels like windows or apple, and what was the programmers intention when writing it?

There is some arguments in favor of fair use in regard to compatibility, where derived works are infringements but still deemed legal. The courts has historically been rather split on this subject when it comes to software, in particular with several cases voting in favor of unlicensed modules to consoles. It would be quite a big bet either way you vote.

ryao · on Feb 19, 2016

If I take a chapter of a textbook, modify it to be a standalone volume in a collection of books and start distributing it, I am distributing a derived work of the original book, not a derived work of the collection of books. The latter constitutes an aggregation and unless there is some license (superseding doctrine of first sale in the case of books) that prevents it from being redistributed with such things, it is perfectly okay to do that.

Similarly, the original code was taken from OpenSolaris and was adapted for Linux. No matter how we change it, it is a derived work of Solaris. Furthermore, it is distributed as part of a mere aggregation, which is okay with OSS under the OSD and also okay with the GPL under the GPL FAQ. The only time you can claim a combined work is formed is when the module is loaded into a running kernel, but the GPL does not restrict non-distribution and the kernel with the module loaded into it is not being distributed.

As for removing it from the Linux kernel, given that it is an entire storage stack between the block layer and VFS, you would need to replace everything there (including the disk format), but yes, you would have a working system.

As for all calls to Linux kernel symbols, those are provided to LKM so that they can function and they cannot function without it. There are symbols not provided at all, symbols provided only to GPL software and symbols provided to everyone. ZFS only uses the last group, which is intended for use by non-GPL software.

You can design software to load a LKM from an arbitrary. FreeBSD had done that with Windows kernel modules for wireless drivers at one point. Wine does that for certain Windows drivers that do copy protection. There is nothing stopping you from creating a kernel under a difference license that loads modules in the LKM format of a given Linux kernel. Although the usual case is to port the code to another kernel's own LKM implementation. Attorneys with whom I (and apparently Canonical too) have spoken think this is okay.

2trill2spill · on Feb 19, 2016

> Can you remove the Linux kernel and still have a complete and working program? What happen if one removed all function calls to the Linux kernel or use of internal kernel variables? As a module, does it work with any other kernels like windows or apple, and what was the programmers intention when writing it?

ZFS was developed on another operating system, Solaris back in the early 2000's and continues to be actively developed on illumos, FreeBSD, OSX and Linux today. However the bulk of new code seems to come from the illumos and FreeBSD communities. ZFS also runs in userspace to allow for easier testing and development. So if you remove Linux you still have a working program, ie it's a working kernel module for illumos, FreeBSD and Mac OSX, as well as a userspace program.

As for the intentions of Jeff Bonwick and Matt Ahrens, it was to make administration of file systems much easier. The video posted below is about the history of ZFS and is presented by one of the creators. The first person talking is the other founder of ZFS.

Birth of ZFS. https://www.youtube.com/watch?v=dcV2PaMTAJ4

tzs · on Feb 19, 2016

> How would you explain to a non-technical person that a kernel module is not a derived work?

GPLv2 does not use the term "derived work" anywhere. It uses "work [...] derived from the Program", and does not define this term [1].

I'd start out by explaining that before we even get to the question of whether or not the module is a "work [...] derived from the Program", we have to ask the question of whether or not the license even applies. GPLv2 only applies if the module does something that requires permission under copyright law. The copyright law question that needs to be asked is whether or not the module is a "derivative work" of the kernel.

> Can you remove the Linux kernel and still have a complete and working program? What happen if one removed all function calls to the Linux kernel or use of internal kernel variables? As a module, does it work with any other kernels like windows or apple, and what was the programmers intention when writing it?

None of these questions are actually relevant to the copyright law question of whether or not it is a derivative work. They are relevant to the question of whether or not it is useful when not used in conjunction with a Linux kernel but that's not a copyright law question.

To answer the copyright law question of whether or not some program P [2] is a derivative work of some other program Q, you only need to look at the source code to P and Q. If P and Q interact with each other (unilaterally or bilaterally, directly or indirectly) some people get hung up on the mechanism of that interaction, but that's not relevant to the question of whether or not P is a derivative work.

Whether or not a program P that uses function names, function argument ordering, and data structures of program Q, but does not copy algorithmic code from Q, is a derivative work of Q is going to essentially come down to whether or not the interface (I'm including data structures as part of the interface) of Q is copyrightable.

If program interfaces are copyrightable, then programs that interact with other programs will be derivative works of those programs, regardless of whether they interface by static linking, dynamic linking, system calls from a user process P to kernel code Q, IPC from process P to process Q, RPC from process P across a network to process Q on another machine and so on.

If program interfaces are not copyrightable, then as long as all P incorporates from Q are interfaces P won't be a derivative work.

Generally, courts have held that program interfaces are not copyrightable (with the notable exception of the Court of Appeals for the Federal Circuit in the Oracle vs. Google case, which does not set copyright precedent).

Thus we arrive at the major question for kernel modules: what copyrightable kernel elements do they incorporate?

If they just incorporate non-copyrightable interfaces then a kernel module would not be a derivative work of the kernel.

That's not the end of the inquiry though. It would be if some third party were making and distributing the module. E.g., if I were to write a kernel module that does not incorporate any copyrightable kernel elements and distribute it stand alone, for others to download if they want and use it with their kernels, we'd be done.

In the case of a distribution vendor distributing a kernel module along with a kernel, then even though the module itself might not be a derivative work their distribution as a whole is. Questions might arise as to just what constitutes a "work". If they statically link the module to the kernel, the resulting binary is clearly a work, and it is a derivative work of both the kernel and the module, and so the module would have to be GPL. It is important to note in this case that this is because the combined work is a derivative work of the kernel...the module itself is still not a derivative work of the kernel.

How about if the module is dynamically linked, but the configuration they ship automatically loads it at boot time? Might one argue that the kernel, init scripts, and dynamic modules together are all one work that the vendor is distributing?

[1] For completeness, GPLv3 does not use "derive" or "derived" or any similar terms it all. It uses the term "covered work", which is defined as the original program or a "work based on the Program", and it defines that as basically a work that requires copyright permission.

[2] I'm going to use the term "program" expansively to include modules, applications, plug-ins, and so on.

gillianseed · on Feb 18, 2016

The CDDL was crafted with GPL existing and was according to the person responsible for creating it, deliberately made incompatible with GPLv2. Of course not hard to understand given that Sun had no reason whatsoever to hand over their prized technology (ZFS, DTrace) to the competitor which was killing them in the market.

>On the other hand, I suspect RMS isn't too happy with this turn of events.

Why not ?

>The sfconservancy may be the more likely party to bring a lawsuit.

That would require that a Linux copyright holder would want to sue, why would they ? OpenZFS is open source, and previous suits has been about source code compliance.

ryao · on Feb 19, 2016

It is probably more accurate to claim the GPL was designed to be incompatible with an entire class of licenses that includes the CDDL, and the MPL on which it was based and any future licenses similar to or based on licenses in that class (of which the CDDL was given that it was made after the GPL).

There is no clause in the CDDL that places restrictions on other files in a combined work, but there is one the GPL. There are people out there who dislike the GPL for that, there are some people who explicitly go out of their way to avoid GPL compatibility because of that and I am sure that some of those people existed at Sun, but I really doubt that the design of a license by a huge organization with many people giving input can be simplified to one guy thinking GPL incompatibility is a good feature.

I also think this happened years ago and there really is no point to living in the past. People cannot distribute a vmlinux file with ZFS linked into it (i.e. not a kernel module, but part of the binary itself) because of that, but that does not stop people from distributing it as a kernel module and that is how filesystem code is loaded these days, so it is a non-issue.

gillianseed · on Feb 19, 2016

>It is probably more accurate to claim the GPL was designed to be incompatible with an entire class of licenses that includes the CDDL,

It was designed to give and preserve rights for end users, it's not really a big mystery, and the actual rights which are given and preserved perfectly mirror that.

I don't see anything that would substantiate your claim of them being 'deliberately' incompatible with any other licenses (anything you can point to ?), in fact they've fixed incompability problems in GPLv3 with other licenses.

And of course both MPL and CDDL came along much later than GPLv2, with which they were incompatible (MPL 2.0 in turn rectified this).

>can be simplified to one guy thinking GPL incompatibility is a good feature.

No, I don't think for a second that it was 'one guy', again Sun management had absolutely zero reason to allow Linux to incorporate ZFS and DTrace and every business reason not to, in fact from a business standpoint it would have been crazy to hand over ZFS and DTrace to their main competitor.

>but that does not stop people from distributing it as a kernel module and that is how filesystem code is loaded these days, so it is a non-issue.

I'm not at all sure it's a non-issue, this is a Linux kernel module running in Linux kernel space, I'm pretty sure there is a strong case for this being considered a derivative, that said I hope it won't be an issue since having ZFS in a native capacity with minimal effort is a boon for Linux.

ryao · on Feb 19, 2016

Do you have a bar number? If not, you being "pretty sure" does not mean much.

ajsalminen · on Feb 19, 2016

> It is probably more accurate to claim the GPL was designed to be incompatible with an entire class of licenses that includes the CDDL, and the MPL on which it was based and any future licenses similar to or based on licenses in that class (of which the CDDL was given that it was made after the GPL).

Given that work was done to make GPLv3 more compatible with other open source licenses and that GPLv2 predates both of the licenses you mention by quite a bit I'm inclined to think that's nonsense.

ryao · on Feb 20, 2016

If being compatible with anything were the goal, the FSF would have opted for the CC0 license. Since the GPL is not compatible with things on that level, it is designed to be incompatible with certain things. Some subset of possible open source licenses definitely were excluded as part of that.

mrb · on Feb 18, 2016

No, Ubuntu may have put the ZFS source in the kernel tree, but they still ship it to endusers as a separate kernel module and separate Ubuntu package (edit: the separate package is "zfsutils-linux" for the userspace code)

AFAIK to violate the GPL they would have to ship ZFS compiled code in the kernel image, but this is not what they are doing.

mjg59 · on Feb 18, 2016

Whether something is a derivative work or not is incredibly unlikely to be determined by whether it's distributed as a module or linked in.

mentat · on Feb 18, 2016

For at least a decade people have been using that route and have had (at least internal) legal opinions supporting it.

ryao · on Feb 19, 2016

If I take a chapter of a textbook, modify it to be a standalone volume in a collection of books and start distributing it, I am distributing a derived work of the original book, not a derived work of the collection of books. The latter constitutes an aggregation and unless there is some license (superseding doctrine of first sale in the case of books) that prevents it from being redistributed with such things, it is perfectly okay to do that.

Similarly, the original code was taken from OpenSolaris and was adapted for Linux. No matter how we change it, it is a derived work of Solaris. Furthermore, it is distributed as part of a mere aggregation, which is okay with OSS under the OSD and also okay with the GPL under the GPL FAQ. The only time you can claim a combined work is formed is when the module is loaded into a running kernel, but the GPL does not restrict non-distribution and the kernel with the module loaded into it is not being distributed.

You can argue that GPL advocates did not intend to support a license that allows any of this. However, I expect that you would have trouble finding an attorney that will interpret what the copyright holder thought the terms said to supersede the legal meaning of the terms unless explicitly stated.

If you make a license for the kernel that does not allow derived works of other platforms' software to be distributed as ports, you would violate #9 of the OSD and could not call it an open source license:

https://opensource.org/osd-annotated

mjg59 · on Feb 19, 2016

If you take the plot from an episode of Star Trek and modify it such that it fits into the Dr Who storyline, you've created a work that's derivative of both Star Trek and Dr Who. Similarly, if you take code from Solaris and modify it such that it tightly integrates with Linux, you've created a work that's derivative of both Solaris and Linux. Since ZFS can only be distributed under the CDDL and since GPLv2 requires all derived works to be distributed under the GPL, you can't satisfy the license.

ryao · on Feb 19, 2016

> If you take the plot from an episode of Star Trek and modify it such that it fits into the Dr Who storyline, you've created a work that's derivative of both Star Trek and Dr Who. Similarly, if you take code from Solaris and modify it such that it tightly integrates with Linux, you've created a work that's derivative of both Solaris and Linux. Since ZFS can only be distributed under the CDDL and since GPLv2 requires all derived works to be distributed under the GPL, you can't satisfy the license.

That is analogous to writing a new piece of software intended to be similar to an existing piece of software rather than a port of software under license. Examples of the former include the Linux kernel (meant to be similar to UNIX SVR4) and the wine project (meant to be similar to Windows). If that argument is valid:

1. Oracle is in an excellent position to sue every Linux user not using Oracle Linux, because they own rights to UNIX SVR4, which they inherited from Sun.

2. Microsoft is in an excellent position to sue wine users.

3. James Cameron and 20th century Fox would also be in trouble with Disney for Avatar's similarities to Pocahontas.

4. Probably plenty of other bad things.

However, this argument does not apply to ZoL because the code originated in OpenSolaris and is under license and exists as a discrete module, rather than a whole program.

So far, the only thing that you have concretely stated is that you met some attorneys who were unwilling to make a decision on legality. You are not an attorney (unless you have obtained a bar number since I last asked) and I have yet to hear that anyone with a bar number that agrees with you.

If you want to prohibit people from using software you write with things that you consider to be derivatives when the law does not recognize them as such, you need a license that makes that explicit. Such a license could not be called an open source license under clause #9 of the open source definition:

https://opensource.org/osd-annotated

Consequently, the GPL is definitely the wrong license for that.

mjg59 · on Feb 19, 2016

> That is analogous to writing a new piece of software intended to be similar to an existing piece of software rather than a port of software under license.

I take ZFS from Solaris. I rewrite it to work with Linux. In which sense is this not equivalent to my analogy? The examples you're giving are not equivalent, because in each case the work was written without deriving from the other copyrighted work.

> However, this argument does not apply to ZoL because the code originated in OpenSolaris and is under license and exists as a discrete module, rather than a whole program.

That's an entirely arbitrary distinction.

> So far, the only thing that you have concretely stated is that you met some attorneys who were unwilling to make a decision on legality.

No, I said that lawyers had told me that ZoL was an infringing work but that we wouldn't know for sure unless a court decided on it: http://www.phoronix.com/forums/forum/software/general-linux-...

> If you want to prohibit people from using software you write with things that you consider to be derivatives when the law does not recognize them as such

Nobody wants that.

ryao · on Feb 20, 2016

> I take ZFS from Solaris. I rewrite it to work with Linux. In which sense is this not equivalent to my analogy? The examples you're giving are not equivalent, because in each case the work was written without deriving from the other copyrighted work.

I take it that you never actually read the ZFSOnLinux source code.

It is not really rewritten. There is a compatibility layer in place to prevent the need to rewrite much of the code and a very small percentage of the original kernel code actually changed to support Linux, but what did change was meant to use for interfaces that are provided by the kernel to allow proprietary modules to be ported, which suggests any license is fine.

However, the claim that writing a brand new TV show script inspired by another forms a derivative work is to claim that writing things from scratch forms a derivative work.

> That's an entirely arbitrary distinction.

It is the distinction lawyers are making.

> No, I said that lawyers had told me that ZoL was an infringing work but that we wouldn't know for sure unless a court decided on it: http://www.phoronix.com/forums/forum/software/general-linux-....

Do you have bar numbers of these lawyers? Is there any reason to think that they were thinking that zfs.ko somehow used GPL exported symbols or some other thing that is not actually true that does not involve taking your word for it? I did have one person going to law school tell me that it was a derivative work because of that. He did not think he could claim otherwise after an explanation that the code does not do that.

Given that your legal views are so incredibly divorced from those of actual lawyers with whom I have talked, I am not inclined to believe you when you say that they had no misunderstanding, especially when it seems that you have never actually read the code to be able to be sure of that.

> Nobody wants that.

Your claims are inconsistent with that.

mjg59 · on Feb 20, 2016

> It is not really rewritten.

It has several direct calls into Linux functionality that don't go via SPL, but it's also unclear that simply adding an abstraction layer is a meaningful mechanism to avoid derivation.

> what did change was meant to use for interfaces that are provided by the kernel to allow proprietary modules to be ported

There are no such interfaces in Linux.

> the claim that writing a brand new TV show script inspired by another forms a derivative work is to claim that writing things from scratch forms a derivative work.

I didn't make that claim. The analogy in question involves taking an existing work and modifying it such that it includes components of another work.

> It is the distinction lawyers are making.

It's the distinction a lawyer that you've spoken to is making.

> Do you have bar numbers of these lawyers?

Yes.

> Is there any reason to think that they were thinking that zfs.ko somehow used GPL exported symbols or some other thing that is not actually true that does not involve taking your word for it?

No.

> Your claims are inconsistent with that.

My claim is that I have reason to believe that, under copyright law, ZoL is a derivative work of Linux and as such is subject to the terms of the GPL. If the final legal determination is that it's not a derivative work then the GPL is irrelevant.

ryao · on Feb 19, 2016

> If I take a chapter of a textbook, modify it to be a standalone volume in a collection of books and start distributing it, I am distributing a derived work of the original book, not a derived work of the collection of books. The latter constitutes an aggregation and unless there is some license (superseding doctrine of first sale in the case of books) that prevents it from being redistributed with such things, it is perfectly okay to do that.

I should elaborate that you need the original to be under license. Otherwise, you have a problem.

cjbprime · on Feb 18, 2016

> Well, Ubuntu may have put the ZFS source in the kernel tree, but they still ship it to endusers as a separate kernel module and separate Ubuntu package.

No, they aren't using a separate Ubuntu package, it's gone straight into the main kernel repo.

> AFAIK to violate the GPL they would have to ship ZFS compiled code in the kernel image, but this is not what they are doing.

You can violate the GPL inside a kernel module that you distribute.

_hyn3 · on Feb 18, 2016

> No, they aren't using a separate Ubuntu package, it's gone straight into the main kernel repo.

How Ubuntu packages is irrelevant. What matters under the GPL is how the module is linked into the kernel.

> You can violate the GPL inside a kernel module that you distribute.

Of course, but they're not doing that. For example, you could violate the GPL by including GPL'ed code in a kernel module under a more restrictive license.

There's a lot more detail here: http://www.tldp.org/HOWTO/Module-HOWTO/copyright.html

bryanlarsen · on Feb 18, 2016

What matters under the Copyright law (and thus the GPL) is whether the module is a derivative work of the Linux kernel or not.

ZFS was originally created for Solaris, and works on multiple operating systems. So ZFS itself is obviously not a Linux derivative. If the original ZFS could be directly linked with the Linux kernel without modifications, it still wouldn't be a Linux derivative.

But ZFS had to be modified to work with Linux. It can be argued that those modifications are Linux derivatives. We haven't had a definitive ruling on this yet.

ZFS from Solaris / BSD --> not a Linux derivative, even if it was directly linked into Linux.

ZFS with trivial modifications to work with Linux --> not a Linux derivative

ZFS with extensive modifications to work with Linux --> judge's ruling required

The only reason that linking matters is because Linus's statement that binary modules are OK would have some weight with the judge. However, Linus is not the only copyright holder of the Linux kernel, and other copyright holders have disagreed with Linus on this statement.

_hyn3 · on Feb 18, 2016

Using the defined interfaces of the kernel does not constitute a derivative work.

gillianseed · on Feb 19, 2016

It's a Linux kernel module running in Linux kernel address space, I'd say there is reason to assume it can be considered a derivative work, and thus a license incompability.

tzs · on Feb 19, 2016

Do you think that there is reason to assume that every program that ran on MS-DOS on an 8086 was a derivative work of MS-DOS? The programs and MS-DOS all ran in the same address space on the 8086.

lmm · on Feb 18, 2016

Who holds the copyright on the kernel interfaces that the ZFS module uses?

ryao · on Feb 19, 2016

The GPLv2 does not restrict placing things under GPLv2 incompatible in the same tree. It only restricts distribution of binaries that are derivative works under copyright law.

baldfat · on Feb 18, 2016

OpenZFS ! I can't see why this has blown up as a violation when people didn't actually read the announcement.

EDIT :

ZFS is licensed under the Common Development and Distribution License (CDDL), and the Linux kernel is licensed under the GNU General Public License Version 2 (GPLv2). While both are free open source licenses they are restrictive licenses. The combination of them causes problems because it prevents using pieces of code exclusively available under one license with pieces of code exclusively available under the other in the same binary. In the case of the kernel, this prevents us from distributing ZFS as part of the kernel binary. However, there is nothing in either license that prevents distributing it in the form of a binary module or in the form of source code. http://open-zfs.org/wiki/Main_Page

throwaway2048 · on Feb 18, 2016

OpenZFS is CDDL liscensed and GPL incompatable.

jackbravo · on Feb 18, 2016

Here is another post from the parent about this issue specificaly:

- http://blog.dustinkirkland.com/2016/02/zfs-licensing-and-lin...

"We at Canonical have conducted a legal review, including discussion with the industry's leading software freedom legal counsel, of the licenses that apply to the Linux kernel and to ZFS.

And in doing so, we have concluded that we are acting within the rights granted and in compliance with their terms of both of those licenses."

baldfat · on Feb 18, 2016

> However, there is nothing in either license CDDL GPL that prevents distributing it in the form of a binary module or in the form of source code.

http://zfsonlinux.org/faq.html#WhatAboutTheLicensingIss

l1ambda · on Feb 18, 2016

"And zfs.ko, as a self-contained file system module, is clearly not a derivative work of the Linux kernel but rather quite obviously a derivative work of OpenZFS and OpenSolaris. Equivalent exceptions have existed for many years, for various other stand alone, self-contained, non-GPL and even proprietary (hi, nvidia.ko) kernel modules."

mjg59 · on Feb 18, 2016

This would be true if the resulting work were not a derivative work of the GPLed kernel. There's plenty of solid legal opinion that it is, and if you accept that then the GPL absolutely prevents distributing it in the form of a binary module.

TimWolla · on Feb 18, 2016

Sorry, I downvoted you by accident, while focusing my browser window.

ryao · on Feb 20, 2016

Only if you build it into the kernel binary image loaded by the bootloader. Otherwise, it is an independent module.

jpgvm · on Feb 18, 2016

Shame how most of the conversation devolved into licensing rubbish. Almost none of us are qualified to speak on that, leave it to the lawyers - which I assure you Canonical did too.

With that out of the way, ZFS is by far and away the best filesystem for container workloads. Hopefully we will get deeper quota and I/O throttling support soon.

I have been using ZoL in production for many years now thanks mostly to the work of Brian Behelendorf and Richard Yao. So if you find yourselves here thanks for all the work you have put into making ZoL awesome.