Hacker Newsnew | past | comments | ask | show | jobs | submit | robkop's commentslogin

Hasn’t ChatGPT been supporting skills with a different name for several months now through “agent”?

They gave it back then folders with instructions and executable files iirc


Not quite the same thing. Implementing skills specifically means that you have code which, on session start, scans the skills/*/skill.md files and reads in their description: metadata and loads that into the system prompt, along with an instruction that says "if the user asks about any of these particular things go and read the skills.md file for further instructions".

Here's the prompt within Codex CLI that does that: https://github.com/openai/codex/blob/ad7b9d63c326d5c92049abd...

I extracted that into a Gist to make it easier to read: https://gist.github.com/simonw/25f2c3a9e350274bc2b76a79bc8ae...


I remember you did some reverse engineering when they released agent, does it not feel quite similar to you?

I know they didn’t dynamically scan for new skill folders but they did have mentions of the existing folders (slides, docs, …) in the system prompt


The main similarity is that both of them take full advantage of the bash tool + file system combination.

You could dual brand as vibe-npm, only install packages that are in your models training dataset


> Are there specific benchmarks that compare models vs themselves with and without scratchpads? High with:without ratios being reasonier models?

Yes, simplest example: https://www.anthropic.com/engineering/claude-think-tool


Being skeptical of all the numbers I see - it still seems instagram is on roughly even footing with TikTok for upcoming generations.

I don’t doubt they may destroy their own product (like google search) but I do think it’s going to take a long long time


And now threads which apparently is quietly growing


There is much missing from this prompt, tool call descriptors is the most obvious. See for yourself using even a year old jailbreak [1]. There’s some great ideas in how they’ve setup other pieces such as cursor rules.

[1]: https://gist.github.com/lucasmrdt/4215e483257e1d81e44842eddb...


They use different prompts depending on the action you're taking. We provided just a sample because our ultimate goal here is to start A/B testing models, optimizing prompts + models, etc. We provide the code to reproduce our work so you can see other prompts!

The Gist you shared is a good resource too though!


Maybe there is some optimization logic that only appends tool details that are required for the user’s query?

I’m sure they are trying to slash tokens where they can, and removing potentially irrelevant tool descriptors seems like low-hanging fruit to reduce token consumption.


I definitely see different prompts based on what I'm doing in the app. As we mentioned there are different prompts for if you're asking questions, doing Cmd-K edits, working in the shell, etc. I'd also imagine that they customize the prompt by model (unobserved here, but we can also customize per-model using TensorZero and A/B test).


Yes this is one of the techniques apps can use. You vectorize the tool description and then do a lookup based on the users query to select the most relevant tools, this is called pre-computed semantic profiles. You can even hash queries themselves and cache tools that were used and then do similarity lookups by query.


cool stuff



Ahh, there's a bug with the z-index on the Turing one (I made it a "legendary card" for surviving 70 year), will fix shortly.

Here's the link for the moment: https://arxiv.org/pdf/2405.08007

Also if you want to read the original Turing paper (it was interesting to look back upon, I think the future of benchmarks may look a lot more like the Turing test): https://courses.cs.umbc.edu/471/papers/turing.pdf


Btw the conversation with the robot is super robot like because in real life people send double, triple, five separate messages in a row and mix topics and different conversations. It still feels very robot like for me.


Thank you for the links!


For my year end I collected data on on how quickly AI benchmarks are becoming obsolete (https://r0bk.github.io/killedbyllm/). Some interesting findings:

2023: GPT-4 was truely something new - It didn't just beat SOTA scores, it completely saturated several benchmarks - First time humanity created something that can beat the turing test - Created a clear "before/after" divide

2024: Others caught up, progress in fits and spurts - O1/O3 used test-time compute to saturate math and reasoning benchmarks - Sonnet 3.5/ 4o incremented some benchmarks into saturation, and pushed new visual evals into saturation - Llama 3/ Qwen 2.5 brought Open Weight models to be competitive across the board

And yet with all these saturated benchmarks, I personally still can't trust a model to do the same work as a junior - our benchmarks aren't yet measuring real-world reliability.

Data & sources (if you'd like to contribute): https://github.com/R0bk/killedbyllm Interactive timeline: https://r0bk.github.io/killedbyllm/

P.S. I've had a hard time deciding what benchmarks are significant enough to include. If you know of other benchmarks (including those yet to be saturated) that help answer "can AI do X" questions then please let me know.


Appreciate the kind words, honestly Karpathy’s YT series is one of the best kickoff series I've ever seen. He has a certain ability to simplify complex problems and ideas that feels a bit Feynmanesque.

And yes please do, and if you have any feedback I'd love to hear it! Half the motivation for this tool is trying to find a better way to build intuition for how these complex models actually function. I believe the best way to do this is by reducing iteration times as much as possible and by bringing models into worlds we understand. Spatially laying their components out and letting us toy with them, seeing what the impacts are and playing more. At the end of the day these models are so high dimensional it's just not possible to dig in and understand from the ground floor upwards, we need better ways to build intuition.


There's a task on my list to write a full tutorial using it to replicate some recent interpretability research (finding induction heads is up first). But even without a full tutorial, I've been surprised how quickly people have been able to pick up and understand it just by selecting a model and playing around.

If you are interested there is this brilliant tutorial [1] by Callum McDougall for the Transformer Lens library. Going through its steps but completing them in Transpector would be a great way to learn it and build out intuition about transformers/ where research is today.

On the model side, I've added a supported model list [2] and a gif of how to switch between models [3], I appreciate the feedback on what information is the most useful for the readme. Furthermore just being aware your question may have been in regards to API access only models (GPT4, Bard...), unfortunately Transpector requires access to the model weights and activations so currently it's not possible to use with those.

[1]: https://colab.research.google.com/drive/1LpDxWwL2Fx0xq3lLgDQ... [2]: https://github.com/R0bk/Transpector/blob/main/docs/supported... [3]: https://github.com/R0bk/Transpector/blob/main/README.md


Postico 2 includes this, been using it for a few months.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: