Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

My hot take: embeddings are overrated. They are overfitted on word overlap, leading to both many false positives and false negatives. If you identify a specific problem with them ("I really want to match items like these, but it does not work"), it is almost impossible to fix them. I often see them being used inappropriately, by people who read about their magical properties, but didn't really care about evaluating their results.


I think there is a deeper technical truth to this that hints at how much space there is to be gained in optimization.

1) that matryoshka representations work so well, and as few as 64 dimensions account for a large majority of the performance

2) that dimensional collapse is observed. Look at your cosine similarity scores and be amazed that everything is pretty similar and despite being a -1 to 1 scale, almost nothing is ever less than 0.8 for most models

I think we’re at the infancy in this technology, even with all of the advances in recent years.


"I really want to match items like these, but it does not work" is just a fine tuning problem.


Yes, in a sense that if you have infinite appropriate dataset and compute. No, in a sense what is practically achievable.


You don't need infinite data. You need ~100k samples. It's also not particularly expensive.


Dude, you are talking total nonsense. What does 100k samples even mean. We have not even established what task we are talking about and you already know that you need that many samples. Not to be offensive, but you seem like the type of guy who believe in these magical properties.


… you established the task earlier? Item X and Item Y should be colocated in embedding space.

This is what people using embedding models for recsys are doing. It’s not rocket science and it doesn’t require “infinite data”.

By 100k samples I mean 100k samples that provide relevance feedback. 100k positive pairs.

I’m working on these kinds of problems with actual customers. Not really sure where the hostility is coming from.


You can easily fix this using embedding arithmetic to build embedding classifiers.


Are there good examples of this working in the wild? Before I comb through all ten blue links... https://www.google.com/search?q=embedding%20arithmetic%20emb...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: