The Apache License you released Lightning under explicitly allows people to clone your framework if they adhere to the license and provides them patent protection if they do so.
As I said in my previous comment, using patents to try and get around an open source license is skeevy as hell.
Just that if you use lightning you'll have zero friction. Well as with the others... you might run into issues inherent in the other framework's hard to work with designs.
Hi William -- we have absolutely not copied any of Lightning's APIs.
In fact, our PyTorch API makes some significantly different design choices than Lightning does -- e.g., we require users to step optimizers and run the backward pass explicitly, which is a bit lower-level but allows for more flexibility when using the API.
Projects copying from each other is exactly in the open source spirit. Don't release something with an Apache License if you don't want people copying it with compatible licenses. Also, patents of open source software (to limit uses that the license would otherwise allow) and API copyrights are pretty strongly frowned upon in the open source community. As a note, the Apache license you released Lightning under grants patent lawsuit protection to anyone using your code under the license so claiming copyright and patent infringement on another Apache licensed project seems amazingly skeevy.
If this is the philosophical stance that Grid and Lightning are taking then it's definitely a project I'm going to advice people to stay well clear off. It's the worst flavor of commercialized open source software and potentially a legal liability to touch in any way as you seem way too lawsuit trigger happy.
The Apache License you released Lightning under explicitly allows people to clone your framework if they adhere to the license and provides them patent protection if they do so.
Can you point out where exactly in those docs you highlight the issue?
I just read the linked page and found no references to data loading limitations or performance limitations. Is it only in the video which isn't search indexed and few people would bother watching?
edit: The page literally advertises the speed of TPUs with "In general, a single TPU is about as fast as 5 V100 GPUs!" which is the exact opposite of warning people.
No, you do not support the TPU infeed, and this is a crucial distinction. Saying that you do support this has caused endless confusion and much surprise. It’s almost not an exaggeration to say that you’re lying (sorry for phrasing this so bluntly, but I’ve seriously spent dozens of hours trying to break this misconception due to hype like this).
TPU support is real. Pytorch does in fact run on TPUs. But you don’t support TPU CPU memory, the staging area that you’re supposed to fill with training data. That staging area is why a TPU v3-512 pod can train an imagenet resnet classifier in 3 minutes at around 1M examples per second.
You will not get anywhere near that performance with pytorch on TPUs. In fact, you’re expected to create a separate VM for every 8 TPU cores. The VMs are in charge of feeding the cores. That’s insane; I’ve driven TPU pods from a single n1-standard-2 using tensorflow.
Repeat after me: if you are required to create more than one VM, you do not (yet!) support TPU pods. I wish I could triple underline this and put it in bold. People need to understand the limitations of this technique. Creating 256 VMs to feed a v3-2048 is not sustainable.
Like I said... pytorch and tensorflow team are working very hard to make this work. And yes, it's not a 1:1 with tensorflow, but we're making progress very aggressively.
I love what you guys are doing, and I love improving the ML ecosystem, but you’ve godda understand, people see this and think “oh, ok, it’s a small difference, no big deal.” In fact it’s a huge difference.
Picture a person with one arm and without legs. Would you say they aren’t “1:1 in terms of features”? They certainly won’t be winning any races.
And unlike real people, you can’t graft on a prosthetic limb to help this situation. The issue I’m describing here is a fundamental one that everyone keeps trying to sweep under the rug and pretend isn’t an issue. And then everyone wonders what’s going on.
I 100% agree. We don't want to misrepresent TPU support. In fact, we explicitly warn users in our docs. Open to suggestions about how we can communicate this much better to our users.
We just need to be a part of the effort to help bridge the big gap and barriers keeping users from TPU adoption.
There's a difference between "supporting TPUs" and "supporting TPUs at 100% potential". Although the distinction is important, I don't think the marketing here is misleading.
Not only is it misleading, it even somehow tricked you. :)
We’re not talking about a small 10% reduction in performance here. We’re talking like 40x differences.
If it seems unbelievable, and like it can’t possibly be true, well: now you understand my frustration here, and why I’m trying to break the myth.
Notice not a single benchmark has ever gone head to head in MLPerf using pytorch on TPUs. And that’s because using pytorch on TPUs requires you to feed each image manually to the TPU on demand, from your VM. Meaning the TPU is always infeed bound.
Engineers should be wincing at the sound of that. Especially anyone with graphics experience. Being infeed bound means you have lots of horsepower sitting around doing nothing. And that’s exactly the situation you’ll end up in with this technique.
There’s a way to settle this decisively: train a resnet classifier on imagenet, as quickly as possible. If you get anywhere near the MLPerf v0.6 benchmarks for tensorflow on TPUs, I will instantly pivot the other direction and sing the praises of pytorch on TPUs far and wide.
- persistent storage (setup env and data persists across restarts)
- free ssh and connect your local IDE
- CPU setup, GPU run (do setup work on CPUs and switch to run on a GPU when ready)
- no credit card
- pay-as-you-go if you need more GPU hours
- A100s, H100s, and more GPUs available