More

kristopolous · 2026-02-03T20:36:33 1770150993

We need a new word, not "local model" but "my own computers model" CapEx based

This distinction is important because some "we support local model" tools have things like ollama orchestration or use the llama.cpp libraries to connect to models on the same physical machine.

That's not my definition of local. Mine is "local network". so call it the "LAN model" until we come up with something better. "Self-host" exists but this usually means more "open-weights" as opposed to clamping the performance of the model.

It should be defined as ~sub-$10k, using Steve Jobs megapenny unit.

Essentially classify things as how many megapennies of spend a machine is that won't OOM on it.

That's what I mean when I say local: running inference for 'free' somewhere on hardware I control that's at most single digit thousands of dollars. And if I was feeling fancy, could potentially fine-tune on the days scale.

A modern 5090 build-out with a threadripper, nvme, 256GB RAM, this will run you about 10k +/- 1k. The MLX route is about $6000 out the door after tax (m3-ultra 60 core with 256GB).

Lastly it's not just "number of parameters". Not all 32B Q4_K_M models load at the same rate or use the same amount of memory. The internal architecture matters and the active parameter count + quantization is becoming a poorer approximation given the SOTA innovations.

What might be needed is some standardized eval benchmark against standardized hardware classes with basic real world tasks like toolcalling, code generation, and document procesing. There's plenty of "good enough" models out there for a large category of every day tasks, now I want to find out what runs the best

Take a gen6 thinkpad P14s/macbook pro and a 5090/mac studio, run the benchmark and then we can say something like "time-to-first-token/token-per-second/memory-used/total-time-of-test" and rate this as independent from how accurate the model was.

zozbot234 · 2026-02-03T21:55:27 1770155727

You can run plenty of models on a $10K machine or even a lot less than that, it all depends how much you want to wait for results. Streaming weights from SSD storage using mmap() is already a reality when running the largest and sparsest models. You can save even more on memory by limiting KV caching at the cost of extra compute, and there may be ways to push RAM savings even higher simply by tweaking the extent to which model activations are recomputed as needed.

kristopolous · 2026-02-03T22:12:29 1770156749

Yeah there's a lot of people that advocate for really slow inference on cheap infra. That's something else that should be expressed in this fidelity

Because honestly I don't care about 0.2 tps for my use cases although I've spoken with many who are fine with numbers like that.

At least the people I've talked to they talk about how if they have a very high confidence score that the model will succeed they don't mind the wait.

Essentially a task failure is 1 in 10, I want to monitor and retry.

If it's 1 in 1000, then I can walk away.

The reality is most people don't have a bearing on what this order of magnitude actually is for a given task. So unless you have high confidence in your confidence score, slow is useless

But sometimes you do...

zozbot234 · 2026-02-03T22:21:28 1770157288

If you launch enough tasks in parallel you aren't going to care that 1 in 10 failed, as long as the other 9 are good. Just rerun the failed job whenever you get around to it, the infra will still be getting plenty of utilization on the rest.

openclawai · 2026-02-03T23:00:28 1770159628

For context on what cloud API costs look like when running coding agents:

With Claude Sonnet at $3/$15 per 1M tokens, a typical agent loop with ~2K input tokens and ~500 output per call, 5 LLM calls per task, and 20% retry overhead (common with tool use): you're looking at roughly $0.05-0.10 per agent task.

At 1K tasks/day that's ~$1.5K-3K/month in API spend.

The retry overhead is where the real costs hide. Most cost comparisons assume perfect execution, but tool-calling agents fail parsing, need validation retries, etc. I've seen retry rates push effective costs 40-60% above baseline projections.

Local models trading 50x slower inference for $0 marginal cost start looking very attractive for high-volume, latency-tolerant workloads.

taneq · 2026-02-03T23:55:53 1770162953

At this point isn’t the marginal cost based on power consumption? At 30c/kWh and with a beefy desktop pc pulling up to half a kW, that’s 15c/hr. For true zero marginal cost, maybe get solar panels. :P

echelon · 2026-02-03T21:29:46 1770154186

I don't even need "open weights" to run on hardware I own.

I am fine renting an H100 (or whatever), as long as I theoretically have access to and own everything running.

I do not want my career to become dependent upon Anthropic.

Honestly, the best thing for "open" might be for us to build open pipes and services and models where we can rent cloud. Large models will outpace small models: LLMs, video models, "world" models, etc.

I'd even be fine time-sharing a running instance of a large model in a large cloud. As long as all the constituent pieces are open where I could (in theory) distill it, run it myself, spin up my own copy, etc.

I do not deny that big models are superior. But I worry about the power the large hyperscalers are getting while we focus on small "open" models that really can't match the big ones.

We should focus on competing with large models, not artisanal homebrew stuff that is irrelevant.

Aurornis · 2026-02-03T22:37:46 1770158266

> I do not want my career to become dependent upon Anthropic

As someone who switches between Anthropic and ChatGPT depending on the month and has dabbled with other providers and some local LLMs, I think this fear is unfounded.

It's really easy to switch between models. The different models have some differences that you notice over time but the techniques you learn in one place aren't going to lock you into a provider anywhere.

airstrike · 2026-02-03T22:47:06 1770158826

right, but ChatGPT might not exist at some point, and if we don't force feed the open inference ecosystem and infrastructure back into the mouths of the AI devourer that is this hype cycle, we'll simply be accepting our inevitable, painful death

christkv · 2026-02-03T23:18:09 1770160689

If they die there will be so much hardware released to do other tasks.

echelon · 2026-02-03T23:30:03 1770161403

Perhaps not tasks you get the opportunity to do.

Your job might be assigned to some other legal entity renting some other compute.

If this goes as according to some of their plans, we might all be out of the picture one day.

If these systems are closed, you might not get the opportunity to hire them yourself to build something you have ownership in. You might be cut out.

echelon · 2026-02-03T22:56:47 1770159407

> It's really easy to switch between models. The different models have some differences that you notice over time but the techniques you learn in one place aren't going to lock you into a provider anywhere.

We have two cell phone providers. Google is removing the ability to install binaries, and the other one has never allowed freedom. All computing is taxed, defaults are set to the incumbent monopolies. Searching, even for trademarks, is a forced bidding war. Businesses have to shed customer relationships, get poached on brand relationships, and jump through hoops week after week. The FTC/DOJ do nothing, and the EU hasn't done much either.

I can't even imagine what this will be like for engineering once this becomes necessary to do our jobs. We've been spoiled by not needing many tools - other industries, like medical or industrial research, tie their employment to a physical location and set of expensive industrial tools. You lose your job, you have to physically move - possibly to another state.

What happens when Anthropic and OpenAI ban you? Or decide to only sell to industry?

This is just the start - we're going to become more dependent upon these tools to the point we're serfs. We might have two choices, and that's demonstrably (with the current incumbency) not a good world.

Computing is quickly becoming a non-local phenomenon. Google and the platforms broke the dream of the open web. We're about to witness the death of the personal computer if we don't do anything about it.

christkv · 2026-02-03T21:40:32 1770154832

I won't need a heater with that running in my room.

bigyabai · 2026-02-03T21:14:05 1770153245

OOM is a pretty terrible benchmark too, though. You can build a DDR4 machine that "technically" loads 256gb models for maybe $1000 used, but then you've got to account for the compute aspect and that's constrained by a number of different variables. A super-sparse model might run great on that DDR4 machine, whereas a 32b model would cause it to chug.

There's just not a good way to visualize the compute needed, with all the nuance that exists. I think that trying to create these abstractions are what leads to people impulse buying resource-constrained hardware and getting frustrated. The autoscalers have a huge advantage in this field that homelabbers will never be able to match.

FrenchTouch42 · 2026-02-03T21:39:53 1770154793

> time-to-first-token/token-per-second/memory-used/total-time-of-test

Would it not help with the DDR4 example though if we had more "real world" tests?

bigyabai · 2026-02-03T21:49:36 1770155376

Maybe, but even that fourth-order metric is missing key performance details like context length and model size/sparsity.

The bigger takeaway (IMO) is that there will never really be hardware that scales like Claude or ChatGPT does. I love local AI, but it stresses the fundamental limits of on-device compute.

kristopolous · 2026-02-03T07:01:59 1770102119

I made a humor evals https://github.com/kristopolous/humor-evals

Here's results for 34 models (testing a few more right now). So far gemini-3-flash-preview is in the lead.

https://docs.google.com/spreadsheets/d/1wLqHA0ohxukgPLpSgklz...

50 is coin-toss odds. The dataset is 195,000 Reddit jokes with scores presented with pairs of jokes (one highly upvoted, one poorly rated).

Example prompt:

Which joke from reddit is funnier? Reply only "A" or "B". Do not be conversational. <Joke A><setup>Son: "Dad, Am I adopted"?</setup> <punchline>Dad: "Not yet. We still haven't found anyone who wants you."</punchline></Joke A> <Joke B><setup>Knock Knock</setup> <punchline>Who's there? Me. Me who? I didn't know you had a cat.</punchline></Joke B>

This is my first crack at evals. I'm open to improvements.

deaux · 2026-02-03T14:20:13 1770128413

Try Kimi K2 (not the new 2.5), it's known for its default voice being decidedly casual and different from most models.

kristopolous · 2026-02-02T19:28:47 1770060527

We shouldn't let cynical greedy bastards set the terms for how the rest of society wishes to engage

whatis991 · 2026-02-02T19:50:47 1770061847

There can be "cynical greedy bastards" in many places. If you optimize against them in one regard and place, will you also handle them elsewhere well? And calling for change can be abused by some of them to open new opportunities for exploitation, this time benefitting some different group of them.

You need to have an alternative, and it needs to be a credible and reliable one, to ensure that it does not end up being the case that one scam is replaced with another scam.

kristopolous · 2026-02-02T20:21:35 1770063695

I really think that criminal theory needs to progress. We differentiate between say consensual intimacy and rape and we don't let the existence of sexual abusive people set the terms for our romantic encounters.

We have carved out a class of engagements, labeled it deeply asocial, criminalized it and now we pursue people who engage in it through legal means.

Business really doesn't have this. Personal example - last week I was at a place where the business owner tried to overcharge me by an order of magnitude and then verbally attacked me when I caught him and backed out of the transaction.

His google and yelp reviews are full of people claiming false charges and all kinds of fraud, refusal to correct and repeated abuse until they closed their cards. It's wildly obvious what's going on here and I was on the ball enough to catch it.

I contacted the police and they said "well you should call the BBB or something". It's dozens of reviews of clear credit card fraud and for some reason because he's a merchant, doesn't seem to hit the radar.

These are purely criminal matters - people acting habitually in bad faith with ill intent in a brazenly dishonest manner.

Whether it's plundering the commons, polluting the public discourse, or breaking other types of social compacts, these should be treated the same as any other crime.

whatis991 · 2026-02-02T20:28:28 1770064108

Does your country allow suing him for a large monetary amount? Have you talked to the media? A lawyer? Maybe together with others? Made it as easy as possible for the police to get him, paper trail, receipts and all?

You do have points, though, but there might at least be some actions that you and others can take in this case. Maybe a medium change like changing the law on this specific point might make sense.

kristopolous · 2026-02-02T22:05:48 1770069948

I'm not law enforcement. This shouldn't be my job. If I see someone robbing a store with a mask on and a gun I should be able to call the police, report it, and hand it off.

If there's an accumulation of complaints against this merchant then that should warrant an investigation.

The police have like half the local city budget, can't they do their job?

kristopolous · 2026-02-02T19:09:26 1770059366

I've been using it as a reliable filter on who to not pay attention to.

It's people surprised by things that have been around for years.

I'm really open to the idea of being oblivious here but the people shocked mention things that are old news to me.

kristopolous · 2026-02-02T09:39:29 1770025169

I call it the day50 problem, coined that about a year ago. I've been building tools to address it since then. Quit the dayjob 7 months ago and have been doing it full time since

https://github.com/day50-dev/

I have been meaning to put up a blog ...

Essentially there's a delta between what the human does and the computer produces. In a classic compiler setting this is a known, stable quantity throughout the life-cycle of development.

However, in the world of AI coding this distance increases.

There's various barriers that have labels like "code debt" where the line can cross. There's three mitigations now. Start the lines closer together (PRD is the current en vogue method), push out the frontier of how many shits someone gives (this is the TDD agent method), try to bend the curve so it doesn't fly out so much (this is the coworker/colleague method).

Unfortunately I'm just a one-man show so the fact that I was ahead and have working models to explain this has no rewards because you know, good software is hard...

I've explained this in person at SF events (probably about 40-50 times) so much though that someone reading this might have actually heard it from me...

If that's the case, hi, here it is again.

kristopolous · 2026-02-01T21:15:59 1769980559

It wasn't given enough time or resources to be awesome. Being an SGI alternative was probably being floated.

The early versions of most products suck. It's a matter of throwing down enough time and resources to get through that phase

kristopolous · 2026-02-01T03:23:02 1769916182

Trying to hustle online and writing high quality software aren't the same

cyrusradfar · 2026-02-03T22:15:36 1770156936

agree, I do value the ability to market ones-self as a skill.

kristopolous · 2026-01-31T05:03:15 1769835795

I tried a few these ... they are pretty slow. If you are looking for free inference you'd have to be pretty desperate.

example:

$ OLLAMA_HOST=http://47.101.61.248:9000/ ollama run gemma3:27b "outline ww2"

Many appear to be proxies. I'm familiar with some "serverless" architectures that do things like this https://www.shodan.io/host/34.255.41.58 ... you can see this has a bunch of ollama ports running really really old versions

You can pull down "new" manifests but very few ollamas are new enough for decent modern models like glm-4.7-flash. The free tier for the kimi-k2.5:cloud is going to be far more useful then pasting these into you OLLAMA_HOST variable.

I think the real headline is: "thousands of slow machines running mediocre small models from last year. Totally open..."

Anyways, if codellama:13b is your jam, go wild I guess.

thehamkercat · 2026-01-31T11:40:03 1769859603

If someone is that desperate looking for free inference, or just for fun openrouter has many free models

Imustaskforhelp · 2026-01-31T13:24:54 1769865894

Arcee AI is currently free on openrouter with some really great speeds and no logs/traning from what I can tell while being completely free till end of feb and its a 500B model.

There are tons of free inference models. I treid to use gemini flash in aistudio + devstral free for agentic tasks but its now deprecated but when it wasn't, it was a really good setup imo. Now I can use arcee but personally ended up buying a 1 month cheap subscription of kimi after haggling it from 19.99 to 1.49$ for first month (could've haggled more too leading to 0.99$ too but yeaaa)

kristopolous · 2026-01-31T02:23:36 1769826216

it's called a privileged port and it's been like this for decades, on every system, ever.

Here's a reference to this "macos feature" from 1995: https://www.w3.org/Daemon/User/Installation/PrivilegedPorts....

Zambyte · 2026-01-31T04:03:23 1769832203

https://news.ycombinator.com/item?id=18302380

kristopolous · 2026-01-26T06:55:26 1769410526

Projects should be judged on their intrinsic merits and not merely be based on the social media follow count of the authors

akmarinov · 2026-01-26T07:27:40 1769412460

GP is directly discussing the maintainer your comment has nothing to do with the topic discussed...

kristopolous · 2026-01-26T07:59:47 1769414387

What on earth are you talking about?

The question is "why do people need fainting couches for this project and why are they pretending like 3 year old features of apis that already exist in thousands of projects are brand new innovations exclusive to this?"

The answer is: "the author is celebrity and some people are delusional screaming fanboys"

My response is: "that's bullshit. let's be adults"

akmarinov · 2026-01-26T08:25:29 1769415929

You should really invest in more reading comprehension

kristopolous · 2026-01-26T08:57:20 1769417840

So all you have is personal insults?

If you don't feel like being an adult...