More

pertymcpert · 2026-01-19T08:15:55 1768810555

$50k a year? Those are rookie numbers. You're actually fine, as a small fish going belly up isn't the end of the world. You can start a new business. For some big tech companies this is potentially near existential. I would know.

pertymcpert · 2026-01-18T06:56:40 1768719400

Problem with modern tech people is their obsession with pointing to things in public. People can't just be writing software anymore.

grayhatter · 2026-01-18T07:22:00 1768720920

I believe LLM, (specifically code gen) has produced nothing of substance. I'm looking for evidence to disprove that assumption. You're welcome to share nothing, but when you brag about how it's fantastical, it's reasonable to ask. And then no one can never prove it... I can only hear that as; I could if I want to, I just don't want to.

If you don't want field questions about it, don't brag about it?

Equally, to your condemnation: the problem with AI enjoyers is they claim it's nearly perfect, and it can do everything, and it makes them so much faster. But every example is barely more than boilerplate, or it's a sham.

Ronsenshi · 2026-01-18T09:04:17 1768727057

Indeed, all these praises give a very "I do have a girlfriend. You don’t know her. She’s from Canada." feel.

falloutx · 2026-01-18T09:37:16 1768729036

Its very hard for any of LLM fans to share anything substantial, Its always just demos and prototypes which even they have no idea how it works. If you work for any big company, you know all of LLM fans who are just trying desperately to show their managers and leaders that they can use LLMs. Even publicly famous programmers who run 10 agents at the same time, when you use their products you see they have become buggy and they have been shipping more slop than their customers require.

pertymcpert · 2026-01-03T05:54:21 1767419661

None of these open source models actually can compete with Sonnet when it comes to real life usage. They're all benchmaxxed so in reality they're not "nipping at the heels". Which is a shame.

viraptor · 2026-01-03T08:00:36 1767427236

M2.1 comes close. I'm using it now instead of Sonnet for real work every day, since the price drop is much bigger than the quality drop. And the quality isn't that far off anyway. They're likely one update away from being genuinely better. Also if you're not in a rush, just letting it run in OpenCode a few extra minutes to solve any remaining issues will cost you only a couple cents, but it will likely get the same end result as Sonnet. That's especially nice on really large tasks like "document everything about feature X in this large codebase, write the docs, now create an independent app that just does X" that can take a very long time.

rubslopes · 2026-01-03T12:53:01 1767444781

I agree. I use Opus 4.5 daily and I'm often trying new models to see how they compare. I didn't think GLM 4.7 was very good, but MiniMax 2.1 is the closest to Sonnet 4.5 I've used. Still not at the same level, and still very much behind Opus, but it is impressive nonetheless.

FYI I use CC for Anthropic models and OpenCode for everything else.

unsupp0rted · 2026-01-05T23:48:06 1767656886

M2.1 is extremely bad at writing tests and following instructions from a .md, I've found

stingraycharles · 2026-01-03T06:05:24 1767420324

It’s a shame but it’s also understandable that they cannot compete with SOTA models like Sonnet and Opus.

They’re focused almost entirely on benchmarks. I think Grok is doing the same thing. I wonder if people could figure out a type of benchmark that cannot be optimized for, like having multiple models compete against each other in something.

c7b · 2026-01-03T07:22:11 1767424931

You can let them play complete-information games (1 or 2 player) with randomly created rulesets. It's very objective, but the thing is that anything can be optimized for. This benchmark would favor models that are good at logic puzzles / chess-style games, possibly at the expense of other capabilities.

NitpickLawyer · 2026-01-03T06:23:27 1767421407

swe-rebench is a pretty good indicator. They take "new" tasks every month and test the models on those. For the open models it's a good indicator of task performance since the tasks are collected after the models are released. A bit tricky on evaluating API based models, but it's the best concept yet.

astrange · 2026-01-03T09:26:39 1767432399

That's lmarena.

pertymcpert · 2026-01-02T21:49:08 1767390548

I know exactly what they're talking about and I lived in another city in the UK (not close to London). It's a thing.

pertymcpert · 2025-12-31T08:32:32 1767169952

My original link was to a comment in this thread which I quoted that from. The link's now been changed to the main thread.

pertymcpert · 2025-12-30T04:34:45 1767069285

Just because you don't see it or refuse to believe people doesn't make you right and them liars. Maybe you're just wrong.

sod22 · 2025-12-30T11:32:51 1767094371

Or maybe I’m just right and you’re just slow at seeing what other people can see.

I’m not a SWE either fyi. Therefore I have no vested interest.

pertymcpert · 2025-12-30T04:33:20 1767069200

There's going to be lots of fuck ups, but with frontier models improving so much there's also going to be lots of great things made. Horrible, soul crushing technical debt addressed because it was offloaded to models rather than spending a person's thought and sanity on it.

I think overall for engineering this is going to be a net positive.

pertymcpert · 2025-12-30T04:28:31 1767068911

Why did you jump to the assumption that this:

> The new normal isn't like that. Rewrite an existing cleanly implemented Vanilla JavaScript project (with tests) in React the kind of rote task you can throw at a coding agent like Claude Code and come back the next morning and expect most (and occasionally all) of the work to be done.

... meant that person would do it in a clandestine fashion rather than this be an agreed upon task prior? Is this how you operate?

zdragnar · 2025-12-30T05:26:10 1767072370

My very first sentence:

> And everyone else's work has to be completely put on hold

On a big enough team, getting everyone to a stopping point where they can wait for you to do your big bang refactor to the entire code base- even if it is only a day later- is still really disruptive.

The last time I went through something like this, we did it really carefully, migrating a page at a time from a multi page application to a SPA. Even that required ensuring that whichever page transitioned didn't have other people working on it, let alone the whole code base.

Again, I simply don't buy that you're going to be able to AI your way through such a radical transition on anything other than a trivial application with a small or tiny team.

nl · 2025-12-30T10:20:05 1767090005

> meant that person would do it in a clandestine fashion rather than this be an agreed upon task prior? Is this how you operate?

This doesn't mean this at all

In an AI heavy project it's not unusual to have many speculative refactors kicked off and then you come back to see what it is like.

Wonder you can do a Rust SIMD optimized version of that Numpy code you have? Try it! You don't even need to waste review time on it because you have heavy test coverage and can see if it is worth looking at.

zeroonetwothree · 2025-12-30T05:18:37 1767071917

If you have 100s of devs working on the project it’s not possible to do a full rewrite in one go. So its to about clandestine but rather that there’s just no way to get it done regardless of how much AI superpowers you bring to bear.

pertymcpert · 2025-12-30T04:26:02 1767068762

To repeat my other comment:

> Opus 4.5 is categorically a much better model from benchmarks and personal experience than Opus 4.1 & Sonnet models. The reason you're seeing a lot of people wax about O4.5 is that it was a real step change in reliable performance. It crossed for me a critical threshold in being able to solve problems by approaching things in systematic ways.

pertymcpert · 2025-12-30T04:25:39 1767068739

Opus 4.5 is categorically a much better model from benchmarks and personal experience than Opus 4.1 & Sonnet models. The reason you're seeing a lot of people wax about O4.5 is that it was a real step change in reliable performance. It crossed for me a critical threshold in being able to solve problems by approaching things in systematic ways.

Why do you use the word "chasing" to describe this? I don't understand. Maybe you should try it and compare it to earlier models to see what people mean.

godelski · 2025-12-30T05:41:56 1767073316

  > Why do you use the word "chasing" to describe this?

I think you'll get the answer to this if you read my comment and your response to understand why you didn't address mine.

Btw, I have tried it. It's annoying that people think the problem is not trying. It was getting old when GPT 3.5 came out. Let's update the argument...