3. DIRECT LLM PROMPTING
In this method, contestants use a traditional LLM (like GPT-4) and rely on prompting techniques to solve ARC-AGI tasks. This was found to perform poorly, scoring <5%. Fine-tuning a state-of-the-art (SOTA) LLM with millions of synthetic ARC-AGI examples scores ~10%.
"LLMs like Gemini or ChatGPT [don't work] because they're basically frozen at inference time. They're not actually learning anything." - François Chollet
Additionally, keep in mind that submissions to Kaggle will not have access to the internet. Using a 3rd-party, cloud-hosted LLM is not possible.
I don’t see gpt4 scores there. In fact I’m particularly interested in the performance of a natively multimodal model, like gpt4o or gemini. It does not really make sense to test a model trained on text on those visual/spatial puzzles.