I have been tempted to try word2vec-like techniques on e-commerce shopping carts as a way to find a particular type of recommendation. I suspect data will be too sparse, though.
Has anyone approached similar techniques on non-text corpuses?
It works…for some definition of “works”. It’s been applied to all kinds of problems—including graphs (Node2Vec) and many other cases where the input isn’t “words”—to the point that I’d consider it a weak baseline for any embedding task. In my experience it is unreasonably effective for simple problems (make a binary classifier for tweets), but the effectiveness drops quickly as the problem gets more complicated.
In your proposed use case I would bet that you will “see” the kind of similarity you’re looking for based on vector similarity, but I also expect it to largely be an illusion due to confirmation bias. It will be much harder to make that similarity actionable to solve the actual business use case. (Like 30% of the time it’ll work like magic; 60% of the time it’ll be “meh”; 10% of the time it’ll be hilariously wrong.)
Ive been looking at a way to use transformer based models on tabular data. The hope is that these models have a much better contextual understanding of words. So embeddings from these models should be of better quality than just word2vec ones
My idea is to use make a table row into a textual description and feed it into a transformer and get a get effectively a sentence embedding. This is effectively a query embedding. Then make a couple of value embeddings for the target you are trying to predict and use cosine similarity to predict the right value embedding and feed that to the ml model as part of the feature set. It works if the categorical values in your table are entities that the model might have learned.
I tried this approach and it did improve the overall performance. The next step would be fine tuning the transformer model. I want to see if I could do it without disturbing the existing weights too much. Here's the library I used to get get the embeddings
I have applied it to ecommerce shopping carts and it works quite well :). The itemids(words) viewed in sequence in a session can be thought of as a sentence.
I have applied it to the names in a population database. It learnt interesting, and expected structure. Visualized with UMAP it clustered by gender first, and then something that probably could be described as cultural origin of name.
Has anyone approached similar techniques on non-text corpuses?