idealo.de offers a price comparison service on millions of products from a wide range of categories. Each day we receive millions of offers that we cannot map to our product catalogue. We started clustering these offers to create new product clusters to ultimately enhance our product catalogue. For this we mainly use two open-source libraries:
- Sentence-Transformers to encode the offers into a vector space
- Facebook Faiss to do K-Nearest-Neighbours search in vector space
We will present our results for various optimisation strategies to fine-tune Transformers for our clustering use case. The strategies include siamese and triplet network architectures, as well as an approach with an additive angular margin loss. Results will also be compared against a probabilistic record linkage and TF-IDF approach.
Further, we will share our lessons learned e.g. how both libraries make Machine Learning Engineer‘s life fairly easy and how we created informative training data for our best performing solution.
Sebastian is a Senior Machine Learning Engineer at idealo.de where he works mainly on NLP and Computer Vision problems to improve the product catalogue. In previous positions he applied machine learning methods to telemetry and tracking data. Sebastian holds a Master’s degree in Business Informatics from Mannheim University.
Affiliation: idealo Internet GmbH
Christopher is the Tech Lead for the Machine Learning Engineering team at idealo.de where he consults on various ML projects and helps building ML platform products. In previous positions he applied ML methods to fMRI as well as financial data. Christopher holds a Master’s degree in statistics from Humboldt University Berlin.
visit the speaker at: Github