My friend works at Shopify and they are 100% all in on AI coding. They let devs spend as much as they want on whatever tool they want. If someone ends up spending a lot of money, they ask them what is going well and please share with others. If you’re not spending they have a different talk with you.
As for me, we get Cursor seats at work, and at home I have a GPU, a cheap Chinese coding plan, and a dream.
> If someone ends up spending a lot of money, they ask them what is going well and please share with others. If you’re not spending they have a different talk with you.
Make a "systemctl start tokenspender.service" and share it with the team?
It feels like working with a professional. It just keeps churning until the work is done, and actually is pretty damn compact with token usage. Definitely lowest output tokens to value of the frontier models.
Most benchmarks show very little improvement of "full quality" over a quantized lower-bit model. You can shrink the model to a fraction of its "full" size and get 92-95% same performance, with less VRAM use.
> How much VRAM does it take to get the 92-95% you are speaking of?
For inference, it's heavily dependent on the size of the weights (plus context). Quantizing an f32 or f16 model to q4/mxfp4 won't necessarily use 92-95% less VRAM, but it's pretty close for smaller contexts.
Thank you. Could you give a tl;dr on "the full model needs ____ this much VRAM and if you do _____ the most common quantization method it will run in ____ this much VRAM" rough estimate please?
Depending on what your usage requirements are, Mac Minis running UMA over RDMA is becoming a feasible option. At roughly 1/10 of the cost you're getting much much more than 1/10 the performance. (YMMV)
I did not expect this to be a limiting factor in the mac mini RDMA setup ! -
> Thermal throttling: Thunderbolt 5 cables get hot under sustained 15GB/s load. After 10 minutes, bandwidth drops to 12GB/s. After 20 minutes, 10GB/s. Your 5.36 tokens/sec becomes 4.1 tokens/sec. Active cooling on cables helps but you’re fighting physics.
Thermal throttling of network cables is a new thing to me…
I admire patience of anyone who runs dense models on unified memory. Personally, I would rather feed an entire programming book or code directory to a sparse model and get an answer in 30 seconds and then use cloud in rare cases it's not enough.
70B dense models are way behind SOTA. Even the aforementioned Kimi 2.5 has fewer active parameters than that, and then quantized at int4. We're at a point where some near-frontier models may run out of the box on Mac Mini-grade hardware, with perhaps no real need to even upgrade to the Mac Studio.
> Heck look at /r/locallama/ There is a reason its entirely Nvidia.
That's simply not true. NVidia may be relatively popular, but people use all sorts of hardware there. Just a random couple of recent self-reported hardware from comments:
You have a point that at scale everybody except maybe Google is using Nvidia. But r/locallama is not your evidence of that, unless you apply your priors, filter out all the hardware that don't fit your so called "hypotheticals and 'testing grade'" criteria, and engage in circular logic.
PS: In fact locallamma does not even cover your "real world use". Most mentions of Nvidia are people who have older GPUs eg. 3090s lying around, or are looking at the Chinese VRAM mods to allow them run larger models. Nobody is discussing how to run a cluster of H200s there.
Mmmm, not really. I have both a4x 3090 box and a Mac m1 with 64 gb. I find that the Mac performs about the same as a 2x 3090. That’s nothing stellar, but you can run 70b models at decent quants with moderate context windows. Definitely useful for a lot of stuff.
Really had to modify the problem to make it seem equal? Not that quants are that bad, but the context windows thing is the difference between useful and not useful.
Equal to the 2x3090? Yeah it’s about equal in every way, context windows included.
As for useful at that scale?
I use mine for coding a fair bit, and I don’t find it a detractor overall. It enforces proper API discipline, modularity, and hierarchal abstraction. Perhaps the field of application makes that more important though. (Writing firmware and hardware drivers).
It also brings the advantage of focusing exclusively on the problems that are presented in the limited context, and not wandering off on side quests that it makes up.
I find it works well up to about 1KLOC at a time.
I wouldn’t imply they were equal to commercial models, but I would definitely say that local models are very useful tools.
They are also stable, which is not something I can say for SOTA models. You cal learn how to get the best results from a model and the ground doesn’t move underneath you just when you’re on a roll.
Not at all. I don't even know why someone would be incentivized by promoting Nvidia outside of holding large amounts of stock. Although, I did stick my neck out suggesting we buy A6000s after the Apple M series didn't work. To 0 people's surprise, the 2xA6000s did work.
It's still very expensive compared to using the hosted models which are currently massively subsidised. Have to wonder what the fair market price for these hosted models will be after the free money dries up.
I always thought that in the case of a rouge AI breakout that we could just cut the power or network. This makes both impossible. The sick genius of SkyNet was having the most defensible infrastructure when it became clear that whoever controls the biggest robot army can take out enemy data centers and control the world. Now I hope that shooting down LEO satellites is cheap and DIY-able.
I think it’s all farce and technically unsound, but I also think that grok-5-elononly is a helluva drug. It’s really got him ready to rally investors behind “spreading the light of consciousness to the universe”. Oh to see the chat logs of their (Elon and his machine girlfriend)’s machinations.
I suppose one of the ADR’s read something like “…who cares about bitflips, man. Isn’t AI all about probability?”
Knowing the insane level of hardening that goes into putting microcontrollers into space, how to the expect to use some 3nm process chip to stand a chance?
A trend at the moment is to just hope for the best in cubesats and other small satellites in LEO. If you’re below the radiation belt it’s apparently tenable. I worked somewhere designing satellite hardware for LEO and we simply opted to use consumer ARM hardware with a special OS with core level redundancy / consensus to manage bit flips. Obviously some problems will present for AI there… but there are arguably bigger problems with AI data centres like the fact that they offer almost no benefit with respects to the costs of putting and maintaining stuff in space!
I think it could get there with business alone, and also with consumer alone given the hardware, shopping, and ads angles. It’s an everything business and nobody on HN seems to understand that.
In all humility I think I at least loosely embody those qualities. Right now I’m in a comfy F500 remote job that is stable, and it’s been at a time where stability has been important for my family. There will come a time when I’m ready to start or work at a startup. When I do, I want to find a place where my values are valued. I come to work engaged no matter what, but my work is able to be far more impactful when it comes from my self, not only my work avatar.
I’m on HN a lot, and I usually tend to passively browse Who’s Hiring and interesting looking YC ads. Outside of that, I don’t think I would pursue a startup job through job search sites. I would most likely want to find projects I think are neat and start to research and maybe contribute if they have OSS projects, then do individual outreach. I’d probably also start blogging and posting more so people can see if I am a fit for them. Agents may be involved, but only insomuch as I could spend more time doing human stuff like writing, listening, and ideation.
I hope this helps a CTO find a good candidate. I’m personally not on the market right now, but AMA if you want help finding similar folks.
I get it. A stunning indictment of our times… but there is something useful AI could be doing that MS has dropped the ball on: personal finance management. I should be able to have copilot grab all my transactions, build me budgets, show me what if scenarios, raise concerns, and help me meet my goals. It should be able to work in Excel where I can see and steer it. The math should be validated with several checks and the output needs to be trustworthy. Ship a free personal finance agent harness and you have your killer app.
I think there are business reasons why they wouldn’t do that, and that makes me sad.
Even a year ago I had success with Claude giving it a photo of my credit card bill and asking it to give me repeating category subtotals, and it flawlessly OCR'd it and wrote a Python program to do as asked, giving me the output.
I'd imagine if you asked it to do a comparison to something else it'd also write code to do it, so get it right (and certainly would if you explicity asked).
As for me, we get Cursor seats at work, and at home I have a GPU, a cheap Chinese coding plan, and a dream.
reply