I reckon this opinion is more prevalent than the hyped blog posts and news stories suggest; I've been asking this exact question of colleagues and most share the sentiment, myself included, albeit not as pessimistic.
Most people I've seen espousing LLMs and agentic workflows as a silver bullet have limited experience with the frameworks and languages they use with these workflows.
My view currently is one of cautious optimism; that LLM workflows will get to a more stable point whereby they ARE close to what the hype suggests. For now, that quote that "LLMs raise the floor, not the ceiling" I think is very apt.
This is a really cynical take. People work differently and get value from different things. It’s probably safe to assume most aren’t virtue signalling about writing.
Great point about the conflation. This makes me realise: for me, writing code is often a big part of thinking through the problem. So it’s no wonder that I’ve found LLMs to be least effective when I cede control before having written a little code myself, ie having worked through the problem a bit.
I’m working with doctors at the moment in a similar area. eGFR is well-known to decline at approx 1 point per year after age 30. You’re fine.
Here’s just one source:
“After the age of 30 years, glomerular filtration rate (GFR) progressively declines at an average rate of 8 mL/min/1.73 m² per decade.4”
Right, maybe my definition of overfitting was wrong, I always understood it more as trying to optimize for a specific benchmark / use case, and then it starts failing in other areas.
But the way you phrase it, it’s just “the model is not properly able to generalize”, ie it doesn’t understand the concept of silence also makes sense.
But couldn’t you then argue that any type of mistake / unknown could be explained as “overfitting” ? Where do you draw the line ?
I don't think so. Overfitting = the model was too closely aligned to the training data and can't generalize towards *unseen* data. I think it saw "silence" before, so it's not overfitting but just garbage in, garbage out.
> [By] that definition any incorrect answer can be explained by “overfitting to training data”.
No it doesn't, for instance some errors would be caused by under fitting. The data could also be correct but your hyperparameters (such as the learning rate or dropout rate) could cause your model to overfit.
> Where do you draw the line between “overfitting to training data” and “incorrect data” ?
There's no need to draw a line between two explanations that aren't mutually exclusive. They can (as in this case) both be true. Overfitting is the symptom; dirty data is the cause.
Silence is never put in the subtitles of a film, since it isn't necessary. The viewers can tell that nothing is being said if there are actors on the screen. And in situations where there are no actors, then there will be a subtitle to indicate what is going on, like "[rock music plays]".
Subtitle authors use this silence to fit in meta information and have done so since the closed captions era.
Proper data cleaning procedures would be to strip this meta data from any subtitle sources. Since this wasn't done, this is fundamentally a classification issue. It may also be an over-fitting issue, but that is secondary to the classification problem.
I think it's a data quality problem first, which might lead to a sort of overfitting as a consequence.
How would the AI know that a series of zero-amplitude audio samples should generate the string "[silence]"?
It can only know that if the vast majority of silent audio segments in the trainser are consistently labelled with that string. But that doesn't seem to be the case: Silence is either not labeled at all, or labeled with all kinds of different markers or labeled with unrelated things, like copyright credits.
So even if the model successfully learns a generalized representation of the concept of "silence", it's not clear at all which of all the different labels it should use for that concept.
So what might happen is that the model then starts to overfit on the tiny variations of the individual silence segments, in a desperate attempt to devise some kind of system behind the all the different "silence" labels - which will of course go wrong spectacularly as such a system doesn't exist. (Or if it does, is entirely accidental and not something that should be learned)
It's actually because it is incapable of recognising when it does not know the answer. It will give you the nearest match, even if that is completely incorrect.
I use Jetbrains AI assistant for its great integration with the editor and the codebase, and have been experimenting with Claude Code too. Jetbrains Assistant still has better editor integration for things like reviewing generated diffs and generating code based on currently selected code.
My augmented workflow is to “chat” with GPT because it’s free and powerful, to refine my ideas and surface things I hadn’t thought about. Then I start writing code to get in to the flow of things. I’ve found if I use the LLM straight away, I disengage, become lazy and lose context and understanding over the code. In these situations, I’ve had to redo the code more often than not. Lack of understanding is one part of why, but more importantly, disengaged prompting leads to vague and incorrect outcomes.
When I’m very clear in my head about my goal, I create a prompt either directly from my cursor, or if the changes are larger, I ask the LLM to not apply the changes but instead show them to me. I do both these things within the IDE in the chat window. I review the code and sometimes I’m happy applying it as is, other times I copy and paste it and tweak it manually.
I’ve got barebones rules set up; I haven’t felt the need to go overboard. Jetbrains Assistant does a good job of passing relevant context to the model.
I keep my prompts top-down (https://en.wikipedia.org/wiki/BLUF_(communication)) and explicit. Sometimes I’m much more detailed than others, and I’ve found that extra detail isn’t always necessary for a good result.
I'm actually building something for JetBrains, it's called https://sweep.dev. We're trying to bring next-edit prediction (like in Cursor) to JetBrains IDEs.
Until an LLM or some new form of AI can manage an entire architecture itself, I say we need a middle ground. Something will always go wrong, and understanding of the system is necessary to troubleshoot.
I think the feeling to keep up is simply fear of being left behind. Fear is the same thing driving people to become defensive when others dismiss the idea of needing to keep up, because that undermines their core belief that the new “AI skills” they’ve acquired will keep them safe from job disruption.
It’s also interesting that at the heart of the skill set evolving around efficient LLM use, is communication. Engineers and technical people have been the first to admit for decades that they struggle with effective communication. Now everyone’s an expert orator, capable of describing in phenomenal detail to the LLM what they need from it?
Not worrying about all this feels so much better!
This week I saw a video of a robot assembling PCBs at lightning speed. This reminded me a lot of an LLM coding for us, but there are still numerous people in the production line to design, oversee and manage this kind of production. Software engineering is changing but not going.
Most people I've seen espousing LLMs and agentic workflows as a silver bullet have limited experience with the frameworks and languages they use with these workflows.
My view currently is one of cautious optimism; that LLM workflows will get to a more stable point whereby they ARE close to what the hype suggests. For now, that quote that "LLMs raise the floor, not the ceiling" I think is very apt.
LinkedIn is full of BS posturing, ignore it.