My hot take: embeddings are overrated. They are overfitted on word overlap, lead...

deepsquirrelnet · on Nov 1, 2024

I think there is a deeper technical truth to this that hints at how much space there is to be gained in optimization.

1) that matryoshka representations work so well, and as few as 64 dimensions account for a large majority of the performance

2) that dimensional collapse is observed. Look at your cosine similarity scores and be amazed that everything is pretty similar and despite being a -1 to 1 scale, almost nothing is ever less than 0.8 for most models

I think we’re at the infancy in this technology, even with all of the advances in recent years.

nostrebored · on Nov 1, 2024

"I really want to match items like these, but it does not work" is just a fine tuning problem.

empiko · on Nov 1, 2024

Yes, in a sense that if you have infinite appropriate dataset and compute. No, in a sense what is practically achievable.

nostrebored · on Nov 1, 2024

You don't need infinite data. You need ~100k samples. It's also not particularly expensive.

empiko · on Nov 2, 2024

Dude, you are talking total nonsense. What does 100k samples even mean. We have not even established what task we are talking about and you already know that you need that many samples. Not to be offensive, but you seem like the type of guy who believe in these magical properties.

nostrebored · on Nov 2, 2024

… you established the task earlier? Item X and Item Y should be colocated in embedding space.

This is what people using embedding models for recsys are doing. It’s not rocket science and it doesn’t require “infinite data”.

By 100k samples I mean 100k samples that provide relevance feedback. 100k positive pairs.

I’m working on these kinds of problems with actual customers. Not really sure where the hostility is coming from.

cheevly · on Nov 1, 2024

You can easily fix this using embedding arithmetic to build embedding classifiers.

mbanerjeepalmer · on Nov 1, 2024

Are there good examples of this working in the wild? Before I comb through all ten blue links... https://www.google.com/search?q=embedding%20arithmetic%20emb...