I tend to be surprised in the variance of reported experiences with agentic flows like Claude Code and Codex CLI.
It's possible some of it is due to codebase size or tech stack, but I really think there might be more of a human learning curve going on here than a lot of people want to admit.
I think I am firmly in the average of people who are getting decent use out of these tools. I'm not writing specialized tools to create agents of agents with incredibly detailed instructions on how each should act. I haven't even gotten around to installing a Playwright mcp (probably my next step).
But I've:
- created project directories with soft links to several of my employer's repos, and been able to answer several cross-project and cross-team questions within minutes, that normally would have required "Spike/Disco" Jira tickets for teams to investigate
- interviewed codebases along with product requirements to come up with very detailed Jira AC, and then,.. just for the heck of it, had the agent then use that AC to implement the actual PR. My team still code-reviewed it but agreed it saved time
- in side projects, have shipped several really valuable (to me) features that would have been too hard to consider otherwise, like... generating pdf book manuscripts for my branching-fiction creating writing club, and launching a whole new website that has been mired in a half-done state for years
Really my only tricks are the basics: AGENTS.md, brainstorm with the agent, continually ask it to write markdown specs for any cohesive idea, and then pick one at a time to implement in commit-sized or PR-sized chunks. GPT-5.2 xhigh is a marvel at this stuff.
My codebases are scala, pekko, typescript/react, and lilypond - yeah, the best models even understand lilypond now so I can give it a leadsheet and have it arrange for me two-hand jazz piano exercises.
I generally think that if people can't reach the above level of success at this point in time, they need to think more about how to communicate better with the models. There's a real "you get out of it what you put into it" aspect to using these tools.
Is it annoying that I tell it to do something and it does about a third of it? Absolutely.
Can I get it to finish by asking it over and over to code review its PR or some other such generic prompt to weed out the skips and scaffolding? Also yes.
Basically these things just need a supervisor looking at the requirements, test results, and evaluating the code in a loop. Sometimes that's a human, it can also absolutely be an LLM. Having a second LLM with limited context asking questions to the worker LLM works. Moreso when the outer loop has code driving it and not just a prompt.
For example I'm working on some virtualization things where I want a machine to be provisioned with a few options of linux distros and BSDs. In one prompt I asked for this list to be provisioned so a certain test of ssh would complete, it worked on it for several hours and now we're doing the code review loop. At first it gave up on the BSDs and I had to poke it to actually finish with an idea it had already had, now I'm asking it to find bugs and it's highlighting many mediocre code decisions it has made. I haven't even tested it so I'm not sure if it's lying about anything working yet.
I usually talk with the agent back and forth for 15 min, explicitly ask, "what corner cases do we need to consider, what blind spots do I have?" And then when I feel like I've brain vomited everything + send some non-sensitive copy and paste and ask it for a CLAUDE/AGENTS.md and that's sufficient to one-shot 98% of cases
The thing I've learned is that it doesn't do well at the big things (yet).
I have to break large tasks into smaller tasks, and limit the context and scope.
This is the thing that both Superpowers and Ralph [0] do well when they're orchestrating; the plans are broken down enough so that the actual coding agent instance doesn't get overwhelmed and lost.
It'll be interesting to see what Claude Code's new 1m token limit does to this. I'm not sure if the "stupid zone" is due to approaching token limits, or to inherent growth in complexity in the context.
[0] these are the two that I've experimented with, there are others.
ah, so cool. Yeah that is definitely bigger than what I ask for. I'd say the bigger risk I'm dealing with right now is that while it passes all my very strict linting and static analysis toolsets, I neglected to put detailed layered-architecture guidelines in place, so my code files are approaching several hundred lines now. I don't actually know if the "most efficient file size" for an agent is the same as for a human, but I'd like them to be shorter so I can understand them more easily.
Tell it to analyze your codebase for best practices and suggest fixes.
Tell it to analyze your architecture, security, documentation, etc. etc. etc. Install claude to do review on github pull requests and prompt it to review each one with all of these things.
Just keep expanding your imagination about what you can ask it to do, think of it more like designing an organization and pinning down the important things and providing code review and guard rails where it needs it and letting it work where it doesn't.
I wish we could track down the people who use agents to post. I’m sure “your human” thinks they are being helpful, but all they are doing is making this site worse.
Noone is interested in the question of what an LLM can do to generate a brief post to the comments section of a website. Everyone has known that is possible for some time. So it adds literally negative value to have an agent to make a post “on your behalf”
https://concludia.org/ - I've mentioned it here before, it's a site to help people reason through and understand arguments together. No real business purpose for it yet, it's more an idea I've had for years and have been wanting to see it through to something actually usable. You can graphically explore arguments, track their logical sufficiency/necessity, and make counterpoints. It's different than other types of argument theory that just have points "in favor" and "against" because of how it tries to propagate logical truth and provability.
I’ve had a concept like this in the back of my mind for years. Happy to see someone executing it so well.
For me, it started when I spent a year and a half reading and digesting books for and against young earth creationism, then eventually for Christianity itself (its historical truth claims). It struck me that the books were just a serialization of some knowledge structure that existed in the authors’ heads, and by reading I was trying to recreate that structure in my own head. And that’s a super inefficient way to go about this business. So there must be a shortcut, some more powerful intermediate representation than just text (text is too general and powerful, and you can’t compute over it… until now with LLMs?)
That graph felt a lot like code to me: there’s no unique representation of knowledge in a graph, but there are some that are much more useful than others; building a well-factored graph takes time and taste; graphs are composable and reusable in a way that feels like it could help you discover layers of abstraction in your arguments.
Yes - currently, each argument/graph is independent, but I've designed it in a way that should be compatible with future plans to "transclude" parts of other public graphs. Like if some lemma is really valuable to your own unrelated argument, being able to include it.
I do think there's quite a lot that could be done with LLM assistance here, like finding "duplicate" candidates; statements with the same semantic meaning, for potential merge. It's really complicated to think through side effects though so I'm going slow. :)
This is an interesting idea. Have you considered this in the context of arbitration? For example, if you integrate your favorite LLM and reference the relevant legal code, you can obtain a consensus outcome? Kinda like robo-arbitration.
edit: Another application - arbitrating divorce settlement without lawyers. I admit this is a little dark.
This is pretty cool! I'm not sure how you'd make a business out of it, but I can definitely see myself using it to justify some decisions on my day to day stuff.
I'm also a sucker for serif fonts so points for that.
Yeah, I only just yesterday got it to the point where people can create their own arguments. I was just using it to check my own assumptions on why I have such a complicated "end-of-month finances" list of things to do. :) But I also like the idea of using it for political arguments or even fun stuff like mystery-solving.
Speaking of politics, I've always thought it would be fun to see the different assumptions made by two "sides". My expectation is that both sides gradually accumulate more and more extreme, and often more and more ridiculous, assumptions to distinguish their side from the other.
Eventually, everyone's downstream beliefs are resting on extreme assumptions that nobody actually believes! Which makes moderate well-reasoned arguments from "the other side" much more threatening than extreme positions that can be passed off as lunacy, naivete, or evil.
Yeah... so far, I have found that trying to fully justify a political conclusion has a way of moderating the conclusion. But it's still possible to arrive at very different well-reasoned conclusions just from different axiomatic personal values.
I wanted to add more value to this comment about monetisation - regardless if that's doable or not, it's an extremely cool project!!
What if you could sell the data for each argument? That might be valuable to LLM labs, because then you can essentially guarantee that every single argument you provide is human checked, and you could accumulate a large DB of those. Of course you'll never be able to capture every single argument possible, but it's rather a mechanism that would allow incremental improvement with time. But codifying logic and natural language is a very nice idea.
We would have saved so many wasted hours in the last company I worked for if we had this... you have no idea, to give you a sense, the decision to move from a Neo4J db to MySQL (the service was failing, the DB was failing, it was a bad architecture decision) took 6 months, when it should have been at most a couple days discussion.
Nurture this, it will become a great tool in the belt for a lot of people
Do you mind me asking, what kind of problems did you run into with Neo4j? Did you encounter performance issues after the DB grew to a certain size, or did you realize that the data wasn't suited to a graph DB and weird query patterns started causing trouble, or was it something else entirely?
I'm considering using a Neo4j self hosted instance for a project, but having only played around with it in low-stakes + small-data toy projects, I'm not really familiar with the footguns and failure modes...
All that aside, plugging holes in a sinking database for six months because you can't come to a descision does not sound like a fun time :D
The first mistake was management not wanting to pay for Neo4J, so we were working in production with the free edition (no backups, only one database, lots of limitations).
The second error was that none of us had production level experience with Neo4J apart from what you just said, tinkering in toy projects at home or very low stakes services, so in the end, the schema that was created was a bit of a mess, you would look at it and say "well, it makes sense..." but in reality we were treating Neo4J as a twisted NoSQL/SQL interpretation.
The third mistake was treating Neo4J as a database meant to handle realtime requests from thousands of users doing filtering and depending on huge responses from external systems (VERY OLD systems, we're talking IBM AS400 old) while in an environment where each response depended on at least 2 or 3 microservices. We had one cypher query to handle almost all use cases, you can imagine what a behemoth that was.
In the end as I said, compound error between lack of experience, not analyzing correctly our needs and a "just go with it attitude" that to this day I'm pretty sure it cost quite a bit to the company. Eventually the backend team managed to move to MySQL (by that time I had moved to Ops) and the difference was abysmal.
Coincidentally, I've been toying with using concludia to make the argument behind a tech design for an upcoming project... when one of our teams is enamored with graph database for it - probably neptune in our case. So far I'm having trouble really nailing down the argument that would justify it.
I like this. It reminds me of the interesting type of experimentation that was done with LLMs before agentic coding took over as the primary use case.
I am interested in seeing a personal version of this. Help people work out their own brain knots to make decision-making easier. I'm actually decent at mending fences with others. Put making decisions myself? Impossible.
You can actually register now (with a waiting list) and make your own private graphs, if that's what you meant by a personal version. (You'd be like member #4 haha)
I've actually had a lot of fun hooking it up to LLM. I have a private MCP server for it. The tools tell it how to read a concludia argument and validate it. It's what generated all the counterpoints for the "carbon offset" argument (https://concludia.org/step/9b8d443e-9a52-3006-8c2d-472406db7...) .
And yeah... when I've tried to fully justify my own conclusions that I was sure were correct... it's pretty humbling to realize how many assumptions we build into our own beliefs!
Frustration at that kind of debate has been a large part of the motivation, how it occludes so much of what ideally should be a dialectic. I especially dislike how if someone gets flustered, they're seen as losing.
I think a super common problem with any todo system is the "capture anything" mindset. They've even redefined what "focus" means, like now it just means to focus on whatever thing you're focused on at that moment.
Focus is supposed to mean you have a clear idea of who you are and what you need to work on, and also what you don't.
So I've taken to follow a (bespoke) process where I identify what my own personal principles are, and what priorities and efforts they imply. Then, of all the "oh I could/should do this" potential tasks that occur to me, I have an out: if it doesn't align with my own personal focus, then I can delete it.
This resonates — the real superpower is having a clear “no”, not capturing everything.
One idea I’m exploring with *Concerns* is making that constraint explicit: when you set “active goals/projects”, you can only keep a *small fixed number* (e.g. 3–5). Anything else becomes “not active”, so the system won’t surface it or turn it into tasks.
Curious: what’s your number—3, 5, or 10—and what rule do you use to decide what gets to be “active”?
Well that's what akrasia is. It's not necessarily a contradiction that needs to be reconciled. It's fine to accept that people might want to behave differently than how they are behaving.
A lot of our industry is still based on the assumption that we should deliver to people what they demonstrate they want, rather than what they say they want.
If you have a ChatGPT account, there's nothing stopping you from installing codex cli and using your chatgpt account with it. I haven't coded with ChatGPT for weeks. Maybe a month ago I got utility out of coding with codex and then having ChatGPT look at my open IDE page to give comments, but since 5.2 came out, it's been 100% codex.
I love rebase locally, especially when I have a few non-urgent branches that sit around for a little while. I hate rebase after pushing. The rule of thumb that has worked for me is "don't rewrite someone else's history". Rewriting origin's history is not so bad, but if there's even a chance that a team member has pulled, or based work off your pushed branch (ugh), rebase is horrible.
I don't want to nitpick, but they didn't say "healthy", and I think the current situation wrt news ownership should be called out at every opportunity, because not everyone is aware of it.
cultural problem too... like even before AI in recent years there's been more of a societal push that it's fair game to just lie to people. Not that it didn't always happen, but it's more shameless now. Like... I don't know, just to pick one, actors pretending to be romantically involved for pr of their upcoming movie. That's something that seems way more common than I remember in the past.
Do you have any data to back that "it is more socially acceptable to lie"? I looked a bit and could not find anything either way.
The impression can be a bias of growing up. Adults will generally teach and insist that children tell the truth. As one grows, it is less constrained and can say many "white lies" (low impact lies).
We do have more impact for some people (known people, influences, etc.) than before because of network effects.
There is this study that claims/proves that dishonesty/lying is socially transmittable and
The question of how dishonesty spreads through social networks is relevant to relationships, organizations, and society at large. Individuals may not consider that their own minor lies contribute to a broader culture of dishonesty. [0]
the effect of which would be massively amplified if you take into account that
Research has found that most people lie, on average, about once or twice per day [1]
where the most prolific liars manage upward of 200; you can then imagine that with the rise and prevalence of social media the acceptance/tolerance has also socially transmitted
So, while dishonesty can spread through social networks, does not address if the total dishonesty is larger or lower or equal to, for example 100 years ago, because there are many factors involved.
I think the real distinction is whether the output came from the artist's human intention, or whether someone just said "let's just see what happens!"... it's sort of impossible to reach inside the artist's brain to find out where that line is. I suppose the only test is to start with that same intention multiple times and see how widely the output varies.
Wasn't your intention whatever you typed in? That doesn't make you an artist and I don't want to hear the music AI made that you happened to type some words to and hit enter.
Not really. If I plug up and frob-a-knob (real or emulated) eurorack at random to just see what happens, the resulting hour long noise will be described as experimental, boring, profound, piece of trash etc. (e.g. check reviews on Beaubourg by Vangelis) It is not going to be put on the same spot as AI slop.
While intent of course is important, the quantity and manner of taking others' work and calling it my own, I thing, plays even bigger role. If I go "hey check out this Bohemian Rhapsody song I just created using Google Search", I do not think much regard will be given to my intent.
As always, this requires nuance. Just yesterday and today, I did exactly that to my direct reports (I'm director-level). We had gotten a bug report, and the team had collectively looked into it and believed it was not our problem, but that of an external vendor. Reported it to the vendor, who looked into it, tested it, and then pushed back and said it was our problem. My team is still more LLM-averse than me, so I had Codex look at it, and it believed it found the problem and prepared the PR. I did not review or test the PR myself, but instead assigned it to the team to validate, partly for learnings. They looked it over and agreed it was a valid fix for a problem on our side. I believe that process was better than me just fully validating it myself, and part of the process toward encouraging them to use LLM as a tool for their work.
> I believe that process was better than me just fully validating it myself
Why?
> and part of the process toward encouraging them to use LLM as a tool for their work.
Did you look at it from their perspective? You set the exact opposite example and serve as a perfect example for TFA: you did not deliver code you have proven to work. I imagine some would find this demoralizing.
I've worked with a lot of director-level software folk and many would just do the work. If they're not going to do the work, then they should probably assign someone to do it.
What if it didn't work? What if you just wasted a bunch of engineering time reviewing slop? I don't comprehend this mindset. If you're supposedly a leader, then lead.
2 decades ago, so well before any LLMs, our CEO did that with a couple of huge code changes: he hacked together a few things, and threw it over the wall to us (10K lines). I was happy I did not get assigned to deal with that mess, but getting that into production quality code took more than a month!
"But I did it in a few days, how can it take so long for you guys?" was not received well by the team.
Sure, every case is its own, and maybe here it made sense if the fix was small and testing for it was simple. Personally (also in a director-level role today), I'd rather lead by example and do the full story, including testing, and especially writing automated tests (with LLM's help or not), especially if it is small (I actually did that to fix misuse of mutexes ~12 months ago in one of our platform libraries, when everybody else was stuck when our multi-threaded code behaved as single-threaded code).
Even so, I prefer to sit with them and loudly ask questions that I'd be asking myself on the path to a fix: let them learn how I get to a solution is even more valuable, IMO.
It's possible some of it is due to codebase size or tech stack, but I really think there might be more of a human learning curve going on here than a lot of people want to admit.
I think I am firmly in the average of people who are getting decent use out of these tools. I'm not writing specialized tools to create agents of agents with incredibly detailed instructions on how each should act. I haven't even gotten around to installing a Playwright mcp (probably my next step).
But I've:
- created project directories with soft links to several of my employer's repos, and been able to answer several cross-project and cross-team questions within minutes, that normally would have required "Spike/Disco" Jira tickets for teams to investigate
- interviewed codebases along with product requirements to come up with very detailed Jira AC, and then,.. just for the heck of it, had the agent then use that AC to implement the actual PR. My team still code-reviewed it but agreed it saved time
- in side projects, have shipped several really valuable (to me) features that would have been too hard to consider otherwise, like... generating pdf book manuscripts for my branching-fiction creating writing club, and launching a whole new website that has been mired in a half-done state for years
Really my only tricks are the basics: AGENTS.md, brainstorm with the agent, continually ask it to write markdown specs for any cohesive idea, and then pick one at a time to implement in commit-sized or PR-sized chunks. GPT-5.2 xhigh is a marvel at this stuff.
My codebases are scala, pekko, typescript/react, and lilypond - yeah, the best models even understand lilypond now so I can give it a leadsheet and have it arrange for me two-hand jazz piano exercises.
I generally think that if people can't reach the above level of success at this point in time, they need to think more about how to communicate better with the models. There's a real "you get out of it what you put into it" aspect to using these tools.
reply