It’s an interesting experiment… but I expect it to quickly die off as the same type message is posted again and again… their probably won’t be a great deal of difference in “personality” between each agent as they are all using the same base.
They're not though, you can use different models, and the bots have memories. That combined with their unique experiences might be enough to prevent that loop.
> In the documentation, IBM warns that setting auto-approve for commands constitutes a 'high risk' that can 'potentially execute harmful operations' - with the recommendation that users leverage whitelists and avoid wildcards
Users have been trained to do this, as shifting the burden to the user with no way to enforce bounds or even sensible defaults.
E.G. I can guarantee that people will whitelist bwrap, crun, docker, expecting to gain advantage from isolation, while the caller can override all of those protections with arguments.
The reality is that we have trained the public to allow local code execution on their devices to save a few cents on a hamburger, we can’t have it both ways.
Unless you are going to teach everyone that they need to make sure address family 40, openat2(), etc.. are unsafe, users have no way to win right now.
The use case has to either explicitly harden or shift blame.
With Opendesktop, OCI, systemd, and kernel all making locally optimal decisions, the reality is that ephemeral VMs is the only ‘safe’ way to run untrusted code today.
Sandboxes can be better but containers on a workstation (without a machine VM) are purely theatre.
Sounds very interesting - I’ve used SQLite in a few Rust based projects where performance was the deciding factor… a perf comparison with this would be very useful
<system-reminder>
IMPORTANT: this context may or may not be relevant to your tasks.
You should not respond to this context unless it is highly relevant to your task.
</system-reminder>
Perhaps a small proxy between Claude code and the API to enforce following CLAUDE.md may improve things… I may try this
I’m threw a few hours at Codex the other day and was incredibly disappointed with the outcome…
I’m a heavy Claude code user and similar workloads just didn’t work out well for me on Codex.
One of the areas I think is going to make a big difference to any model soon is speed. We can build error correcting systems into the tools - but the base models need more speed (and obviously with that lower costs)
Not GP but my experience with Haiku-4.5 has been poor. It certainly doesn't feel like Sonnet 4.0 level performance. It looked at some python test failures and went in a completely wrong direction in trying to address a surface level detail rather than understanding the real cause of the problem. Tested it with Sonnet 4.5 and it did it fine, as an experienced human would.