Not quite the same thing. Implementing skills specifically means that you have code which, on session start, scans the skills/*/skill.md files and reads in their description: metadata and loads that into the system prompt, along with an instruction that says "if the user asks about any of these particular things go and read the skills.md file for further instructions".
There is much missing from this prompt, tool call descriptors is the most obvious. See for yourself using even a year old jailbreak [1]. There’s some great ideas in how they’ve setup other pieces such as cursor rules.
They use different prompts depending on the action you're taking. We provided just a sample because our ultimate goal here is to start A/B testing models, optimizing prompts + models, etc. We provide the code to reproduce our work so you can see other prompts!
The Gist you shared is a good resource too though!
Maybe there is some optimization logic that only appends tool details that are required for the user’s query?
I’m sure they are trying to slash tokens where they can, and removing potentially irrelevant tool descriptors seems like low-hanging fruit to reduce token consumption.
I definitely see different prompts based on what I'm doing in the app. As we mentioned there are different prompts for if you're asking questions, doing Cmd-K edits, working in the shell, etc. I'd also imagine that they customize the prompt by model (unobserved here, but we can also customize per-model using TensorZero and A/B test).
Yes this is one of the techniques apps can use. You vectorize the tool description and then do a lookup based on the users query to select the most relevant tools, this is called pre-computed semantic profiles. You can even hash queries themselves and cache tools that were used and then do similarity lookups by query.
Also if you want to read the original Turing paper (it was interesting to look back upon, I think the future of benchmarks may look a lot more like the Turing test): https://courses.cs.umbc.edu/471/papers/turing.pdf
Btw the conversation with the robot is super robot like because in real life people send double, triple, five separate messages in a row and mix topics and different conversations. It still feels very robot like for me.
For my year end I collected data on on how quickly AI benchmarks are becoming obsolete (https://r0bk.github.io/killedbyllm/). Some interesting findings:
2023: GPT-4 was truely something new
- It didn't just beat SOTA scores, it completely saturated several benchmarks
- First time humanity created something that can beat the turing test
- Created a clear "before/after" divide
2024: Others caught up, progress in fits and spurts
- O1/O3 used test-time compute to saturate math and reasoning benchmarks
- Sonnet 3.5/ 4o incremented some benchmarks into saturation, and pushed new visual evals into saturation
- Llama 3/ Qwen 2.5 brought Open Weight models to be competitive across the board
And yet with all these saturated benchmarks, I personally still can't trust a model to do the same work as a junior - our benchmarks aren't yet measuring real-world reliability.
P.S. I've had a hard time deciding what benchmarks are significant enough to include. If you know of other benchmarks (including those yet to be saturated) that help answer "can AI do X" questions then please let me know.
Appreciate the kind words, honestly Karpathy’s YT series is one of the best kickoff series I've ever seen. He has a certain ability to simplify complex problems and ideas that feels a bit Feynmanesque.
And yes please do, and if you have any feedback I'd love to hear it! Half the motivation for this tool is trying to find a better way to build intuition for how these complex models actually function. I believe the best way to do this is by reducing iteration times as much as possible and by bringing models into worlds we understand. Spatially laying their components out and letting us toy with them, seeing what the impacts are and playing more. At the end of the day these models are so high dimensional it's just not possible to dig in and understand from the ground floor upwards, we need better ways to build intuition.
There's a task on my list to write a full tutorial using it to replicate some recent interpretability research (finding induction heads is up first). But even without a full tutorial, I've been surprised how quickly people have been able to pick up and understand it just by selecting a model and playing around.
If you are interested there is this brilliant tutorial [1] by Callum McDougall for the Transformer Lens library. Going through its steps but completing them in Transpector would be a great way to learn it and build out intuition about transformers/ where research is today.
On the model side, I've added a supported model list [2] and a gif of how to switch between models [3], I appreciate the feedback on what information is the most useful for the readme. Furthermore just being aware your question may have been in regards to API access only models (GPT4, Bard...), unfortunately Transpector requires access to the model weights and activations so currently it's not possible to use with those.
They gave it back then folders with instructions and executable files iirc
reply