Hacker Newsnew | past | comments | ask | show | jobs | submit | MikeTheGreat's commentslogin

I forget where I saw this (a Medium post, somewhere) but someone summed this up as "I didn't sign up for this just to be a tech priest for the machine god".


Someone commented yesterday that managers and other higher-ups are "already ok with non-deterministic outputs", because that's what engineers give them.

As a manager/tech-lead, I've kind of been a tech priest for some time.


Which is why it's so funny to hear seasoned engineers lament the probabilistic nature of AI systems, and how you have to be hand setting code to really think about the problem domain.

They seem to all be ICs that forget that there are abstraction layers above them where all of that happens (and more).


To be fair, making a change (particularly changing a habit) takes time. Having something there to remind and nudge you helps make this easier, especially when you're tired, stressed, 'just looking for a short break', etc, etc.

It's like they say: "Your demons will comfort you when no one else will. That's why it's so hard to get rid of them"


(My apologies if this was already asked - this thread is huge and Find-In-Page-ing for variations of "pre-train", "pretrain", and "train" turned up nothing about this. If this was already asked I'd super-appreciate a pointer to the discussion :) )

Genuine question: How is it possible for OpenAI to NOT successfully pre-train a model?

I understand it's very difficult, but they've already successfully done this and they have a ton of incredibly skilled and knowledgeable, well-paid and highly knowledgeable employees.

I get that there's some randomness involved but it seems like they should be able to (at a minimum) just re-run the pre-training from 2024, yes?

Maybe the process is more ad-hoc (and less reproducible?) than I'm assuming? Is the newer data causing problems for the process that worked in 2024?

Any thoughts or ideas are appreciated, and apologies again if this was asked already!


> Genuine question: How is it possible for OpenAI to NOT successfully pre-train a model?

The same way everyone else fails at it.

Change some hyper parameters to match the new hardware (more params), maybe implement the latest improvements in papers after it was validated in a smaller model run. Start training the big boy, loss looks good, 2 months and millions of dollars later loss plateaus, do the whole SFT/RL shebang, run benchmarks.

It's not much better than the previous model, very tiny improvements, oops.


add to it multiple iterations of having to restart pretraining from some earlier checkpoint when loss plateaus too early or starts increasing due to some bugs…


Isn't that what GPT 4.5 was?


That was a large model that iiuc was too expensive to serve profitably

Many people thought it was an improvement though


I’m not sure what ‘successfully’ means in this context. If it means training a model that is noticeably better than previous models, it’s not hard to see how that is challenging.


Ah. Thanks for posting - this makes a lot of sense.

I can totally see how they're able to pre-train models no problem, but are having trouble with the "noticeably better" part.

Thanks!


OpenAI allegedly has not completed a successful pretraining run since 4o


You don't train the next model by starting with the previous one.

A company's ML researchers are constantly improving model architecture. When it's time to train the next model, the "best" architecture is totally different from the last one. So you have to train from scratch (mostly... you can keep some small stuff like the embeddings).

The implication here is that they screwed up bigly on the model architecture, and the end result was significantly worse than the mid-2024 model, so they didn't deploy it.


I can not say how big ML companies do it, but from personal experience of training vision models, you can absolutely reuse the weights of barely related architectures (add more layers, switch between different normalization layers, switch between separable/full convolution, change activation functions, etc.). Even if the shapes of the weights do not match, just do what you have to do to make them fit (repeat or crop). Of course the models will not work right away, but training will go much faster. I usually get over 10 times faster convergence that way.


It’s possible the model architecture influences the effectiveness of utilizing pretrained weights. i.e. cnns might be a good fit for this since the first portion is the feature extractor, but you might scrap the decoder and simply retrain that.

Can’t say whether the same would work with Transformer architecture, but I would guess there are some portions that could potentially be reused? (there still exists an encoder/feature extraction portion)

If you’re reusing weights from an existing model, then it seems it becomes more of a “fine-tuning” exercise as opposed to training a novel foundational model.


Huh - I did not know that, and that makes a lot of sense.

I guess "Start software Vnext off the current version (or something pretty close)" is such a baseline assumption of mine that it didn't occur to me that they'd be basically starting over each time.

Thanks for posting this!


GPT4.5 was allegedly such a pre-train. It just didn’t perform good enough to announce and product it as such.


it wasn't economical to deploy but i expect it wasn't wasted, expect the openai team to pick that back up at some point


The scoop Dylan Patel got was that part way through the gpt4.5 pretraining run the results were very very good, but it leveled off and they ended up with a huge base model that really wasn't any better on their evals.


Recently I was talking with someone who speculated that "hot doctors never get accurate heart rate measurements"

It took me a minute to understand (and neither of us think this is 100% true) but it's both funny and a good point.


    and I'm not an LLM (that I can tell so far).
Maybe ask your doctor to administer a Turing test?


I believe the correct test to administer would be the Voight-Kampff test.

https://bladerunner.fandom.com/wiki/Voight-Kampff_test


Genuine question: What does "PEI" mean?

A quick Google search is turning up "Prince Edward Island". Is Prince Edward Island known for being a place with a lot of remote tech workers? (Like, this doesn't _sound_ right, but I know next to nothing about Prince Edward Island :) )


Prince Edward Island sounds a _great_ place to work from home!

Source: am WFH in remote farmhouse in Scandinavia- with fibre.


No, PEI does not have a lot of remote workers. AFAIK the main provider of tech jobs there is government in various forms, and they mandate several in-office days per week.

It's just a place that is geographically disconnected from the mainland, and it rhymes.

Source: I work with some people from PEI.


Ok, this is as close as I'm ever gonna get to having a real reason to post this on HN, so here goes:

"Git Gud" by Viva La Dirt League: https://www.youtube.com/watch?v=blSXTZ3Nihs


Praise the sun!


Genuine question: What do you mean by " ask it to implement the plan in small steps"?

One option is to write "Please implement this change in small steps?" more-or-less exactly

Another option is to figure out the steps and then ask it "Please figure this out in small steps. The first step is to add code to the parser so that it handles the first new XML element I'm interested in, please do this by making the change X, we'll get to Y and Z later"

I'm sure there's other options, too.


My method is that I work together with the LLM to figure out the step-by-step plan.

I give an outline of what I want to do, and give some breadcrumbs for any relevant existing files that are related in some way, ask it to figure out context for my change and to write up a summary of the full scope of the change we're making, including an index of file paths to all relevant files with a very concise blurb about what each file does/contains, and then also to produce a step-by-step plan at the end. I generally always have to tell it to NOT think about this like a traditional engineering team plan, this is a senior engineer and LLM code agent working together, think only about technical architecture, otherwise you get "phase 1 (1-2 weeks), phase 2 (2-4 weeks), step a (4-8 hours)" sort of nonsense timelines in your plan. Then I review the steps myself to make sure they are coherent and make sense, and I poke and prod the LLM to fix anything that seems weird, either fixing context or directions or whatever. Then I feed the entire document to another clean context window (or two or three) and ask it to "evaluate this plan for cohesiveness and coherency, tell me if it's ready for engineering or if there's anything underspecified or unclear" and iterate on that like 1-3 times until I run a fresh context window and it says "This plan looks great, it's well crafted, organized, etc...." and doesn't give feedback. Then I go to a fresh context window and tell it "Review the document @MY_PLAN.md thoroughly and begin implementation of step 1, stop after step 1 before doing step 2" and I start working through the steps with it.


The problem is, by the time you’ve gone through the process of making a granular plan and all that, you’ve lost all productivity gains of using the agent.

As an engineer, especially as you get more experience, you can kind of visualize the plan for a change very quickly and flesh out the next step while implementing the current step

All you have really accomplished with the kind of process described is make the worlds least precise, most verbose programming language


I'm not sure how much experience you have, I'm not trying to make assumptions, but I've been working in software over 15 years. The exact skill you mentioned - can visualize the plan for a change quickly - is what makes my LLM usage so powerful, imo.

I can say the right precise wording in my prompt to guide it to a good plan very quickly. As the other commenter mentioned, the entire above process only takes something like 30-120 minutes depending on scope, and then I can generate code in a few minutes that would take 2-6 weeks to write myself, working 8 hr days. Then, it takes something like 0.5-1.5 days to work out all the bugs and clean up the weird AI quirks and maybe have the LLM write some playwright tests or whatever testing framework you use for integration tests to verify it's own work.

So yes, it takes significant time to plan things well for good results, and yes the results are often sloppy in some parts and have weird quirks that no human engineer would make on purpose, but if you stick to working on prompt/context engineering and getting better and faster at the above process, the key unlock is not that it just does the same coding for you, with it generating the code instead. It's that you can work as a solo developer at the abstraction level of a small startup company. I can design and implement an enterprise grade SSO auth system over a weekend that integrates with Okta and passes security testing. I can take a library written in one language and fully re-implement it in another language in a matter of hours. I recently took the native libraries for Android and iOS for a fairly large, non-trivial SDK, and had Claude build me a React Native wrapper library with native modules that integrates both natives libraries and presents a clean, unified interface and typescript types to the react native layer. This took me about two days, plus one more for validation testing. I have never done this before. I have no idea how "Nitro Modules" works, or how to configure a react native library from scratch. But given the immense scaffolding abilities of LLMs, plus my debugging/hacking skills, I can get to a really confident place, really quickly and ship production code at work with this process, regularly.


It takes maybe 30min and then it can go off and generate code that would take literal weeks for me to write. There are still huge productivity gains being had.


That has not been my experience at all.

It takes 30-40 minutes to generate a plan and it generates code that would have taken 20-30 minutes to write.

When it’s generating “weeks” worth of code, it inevitably goes off the rails and the crap you get goes in the garbage.

This isn’t to say agents don’t have their uses, but i have not seen this specific problem actually work. They’re great for refactoring (usually) and crapping out proof of concepts and debugging specific problems. It’s also great for exploring a new code base where you have little prior knowledge.

It makes sense that it sucks at generating large amounts of code that fits cohesively into the project. The context is too small. My code base is millions of lines of code. My brain has a shitload more of that in context than any of the models. So they have to guess and check and end up incorrect and poor and i don’t. I know which abstractions exist that i can use. It doesn’t. Sometimes it guesses right. Often Times it doesn’t. And once it’s wrong, it’s fucked for the entire rest of the session so you just have to start over


Works for me. Not vanilla Claude code though- you need to put some work into generating slash commands and workflows that keep it on task and catch the bad stuff.

Take this for example: https://www.reddit.com/r/ClaudeAI/comments/1m7zlot/how_planm...

This trick is just the basic stuff, but it works really well. You can add on and customize from there. I have a “/task” slash command that will run a full development cycle with agents generating code, many more (12-20) agent critics analyzing the unstaged work, all orchestrated by a planning agent that breaks the complex task into small atomic steps.

The first stage of this project (generating the plan) is interactive. It can then go off and make 10kLOC code spread over a dozen commits and the quality is good enough to ship, most of the time. If it goes off the rails, keep the plan document but nuke the commits and restart. On the Claude MAX plan this costs nothing.

This is how I do all my development now. I spend my time diagnosing agent failures and fixing my workflows, not guiding the agent anymore (other than the initial plan document).

I still review every line of code before pushing changes.


I tell it to generate a todo.md file with hyper atomic todos each requiring 20 loc or less. Then have it go through that. If the change is too big, generate phases (5-25) and then do the todos for each phase. That plus some sort of reference docs/high level plan keeps it going along all right.


What I do is a step is roughly a reviewable commit.

So I'll say something like "evaluate the URL fetcher library for best practices, security, performance, and test coverage. Write this up in a markdown file. Add a design for single flighting and retry policy. Break this down into steps so simple even the dumbest LLM won't get confused.

Then I clear the context window and spawn workers to do the implementation.


What does the :P command do?

/s


                                                        :P :Print                                                                                      
  :[range]P[rint] [count] [flags]                                                                                                                        
                          Just as ":print".  Was apparently added to Vi for                                                                              
                          people that keep the shift key pressed too long...                                                                             
                          This command is not supported in Vim9 script.                                                                                  
                          Note: A user command can overrule this command.                                                                                
                          See ex-flags for [flags].


Searching for "vim game" this is the only thing I found:

Vscode Vim Academy

https://marketplace.visualstudio.com/items?itemName=kaisun.v...

Does that look like what you used?


ok actually upon closer inspection, the extension was not a game, just a "learn vim" type of thing. I think I got it mixed up with a vim game I found online.

https://marketplace.visualstudio.com/items?itemName=vinthara...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: