Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Best Cloud GPU Platform?
9 points by catasaurus on Dec 11, 2023 | hide | past | favorite | 7 comments
Many options exist, but I have been annoyed with how finicky some can be. All I need is a well-priced cloud GPU platform, with a good command-line utility like fly.io

Would be cool if I could just use said tool to say send a Python file, and include something like a poetry pyproject.toml to set up the environment + info about what type of server I want it to run on (GPU, etc), and just have it run the file on their servers and send the output back to the command line tool + have cloud storage that the programs can output files too (like model weights).



As someone in the process of building a bare metal GPU CSP (cloud service provider), this is interesting and I'm listening.

Thing is though, what you're asking for is really something you can build yourself. Rent a bare metal server full of GPUs and do whatever you want.

You're the first person I've heard ask for something like this. Most people ask for fewer APIs and want to be closer to bare metal. They want more control over things, not less.

Can you explain what you mean by finicky?


Yeah, I understand that it is something that can be done fairly easily but I am too lazy to do so at the moment (lol). By finicky I mean like usually kind of buggy and hard-to-use web interfaces to set up some weird Jupyter Notebook that they run or SSH (hard-to-use as in something like AWS Sagemaker). I understand the use of these things but I just want to be able to use my own development environment to do everything I need to do and then send some code over to run. Preferably the pricing would be kind of serverless, just paying for the time of the compute used to run the programs that are run, not just paying for the reserved server to sit there.


What you're essentially asking for (especially with the toml like configuration) is SLURM + GPU cluster.

SLURM does that wrapping for you, where you essentially just point to the file that you want to run, along with some high level GPU and CPU resource allocation tags, and it just schedules and runs it for you.

I have seen some people trying to run GCP (lol) with SLURM, and wouldn't be surprised if it is possible with AWS/Lambda or any of the other cluster service providers (Cluster-as-a-service, CLaaS?).

Just through one Google search, looks like its definitely possible with AWS: https://docs.aws.amazon.com/parallelcluster/latest/ug/slurm-...


It sounds like you might be wanting something for GPU batch job management. Some things to check out would be gpu orchestration tools, specifically: Slurm, Run.ai, and Skypilot.

Or maybe you're kind of wanting a serverless GPU cloud - check out Runpod, Modal, Baseten, and Replicate.

Links:

https://slurm.schedmd.com/documentation.html

https://www.run.ai/ml-workflow-management

https://github.com/skypilot-org/skypilot

https://www.runpod.io/serverless-gpu

https://modal.com/pricing

https://www.baseten.co/pricing/

https://replicate.com/pricing


SLURM is mentioned by @momofuku.

Ray is another good candidate as well (and feels more modern imho).

https://www.ray.io/


ray looks interesting, will check out

EDIT: so the main thing is technologies like Ray have a way to do these things, but I honestly just want an easy way to do this. Maybe means I will have to set up something with Ray and AWS myself and make a wrapper for that?


I haven't used ray, but I've read a bit of documentation on it and from what I gather, you install a daemon on the box (ray core) and can send it commands that it executes. Along the way, you can keep state, store data and schedule things.

https://docs.ray.io/en/latest/ray-core/key-concepts.html#tas...

https://docs.ray.io/en/latest/ray-core/examples/gentle_walkt...

That's what I would want something to do if I was building tooling like this myself. Although, I'd do it in golang instead of python so that the dependency chain was simpler. A single small binary is nicer imho.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: