Genuine question: How does this work? How does an LLM do object detection? Or mo...

sashank_1509 · 2025-07-10T15:29:03 1752161343

You tokenize the image and then pass it through a vision encoder that is generally trained separately from large scale pretraining (using say contrastive captioning) and then added to the model during RLHF. I’m not surprised if the vision encoder is used in pre training now too, this will be a different objective than next token prediction of course (unless they use something like next token prediction for images which I don’t think is the case).

Different models have different encoders, they are not shared as the datasets across models and even model sizes vary. So performance between models will vary.

What you seem to be thinking is that text models were simply calling an API to a vision model, similar to tool-use. That is not what’s happening, it is much more inbuilt, the forward pass is going through the vision architecture to the language architecture. Robotics research has been doing this for a while.

namibj · 2025-07-11T07:09:19 1752217759

They might use YouTube; there's next-frame prediction and multimodal grounding via subtitles and audio available.

IIUC they got the native voice2voice models trained on YT-sourced audio. Skipping any intermediate text form is really helpful for fuzzy speech such as from people slurring/mumbling words. Also having access to a full world model during voice-deciphering obviously helps with situations that are very context-heavy, such as for example (spoken/Kana/phonetic) Japanese (which relies on human understanding of context to parse homophones, and non-phonetic Han (Kanji) in writing to make up for the inability to interject clarification).

simonw · 2025-07-10T16:48:14 1752166094

> I always thought tasks like this are usually just handed to an other (i.e. vision) model, but the post talks about it as if it's the _same_ model doing both text generation and vision.

Most vision LLMs don't actually use a separate vision model. https://huggingface.co/blog/vlms is a decent explanation of what's going on.

Most of the big LLMs these days are vision LLMs - the Claude models, the OpenAI models, Grok and most of the Gemini models all accept images in addition to text. To my knowledge none of them are using tool calling to a separate vision model for this.

Some of the local models can do this too - Mistral Small and Gemma 3 are two examples. You can tell they're not tool calling to anything because they run directly out of a single model weights file.

gylterud · 2025-07-10T20:07:05 1752178025

Not a contradiction to anything you said, but O3 will sometimes whip up a python script to analyse the pictures I give it.

For instance, I asked it to compute the symmetry group of a pattern I found on a wallpaper in a Lebanese restaurant this weekend. It realised it was unsure of the symmetries and used a python script to rotate and mirror the pattern and compare to the original to check the symmetries it suspected. Pretty awesome!

Legend2440 · 2025-07-10T15:25:06 1752161106

It used to be done that way, but newer multimodal LLMs train on a mix of image and text tokens, so they don’t need a separate image encoder. There is just one model that handles everything.

famouswaffles · 2025-07-11T00:10:40 1752192640

If you have 20 minutes, this is a very good video

https://www.youtube.com/watch?v=EzDsrEvdgNQ

Cheer2171 · 2025-07-10T16:10:32 1752163832

tokens are tokens