Transformer based clustering: Identifying product clusters for E-commerce

Sebastian Wanner, Christopher Lennan

Wednesday 14:40 in A1 wednesday wednesday-14-40

Type/Track Talk pydata-natural-language-processing

idealo.de offers a price comparison service on millions of products from a wide range of categories. Each day we receive millions of offers that we cannot map to our product catalogue. We started clustering these offers to create new product clusters to ultimately enhance our product catalogue. For this we mainly use two open-source libraries:

Sentence-Transformers to encode the offers into a vector space
Facebook Faiss to do K-Nearest-Neighbours search in vector space

We will present our results for various optimisation strategies to fine-tune Transformers for our clustering use case. The strategies include siamese and triplet network architectures, as well as an approach with an additive angular margin loss. Results will also be compared against a probabilistic record linkage and TF-IDF approach.

Further, we will share our lessons learned e.g. how both libraries make Machine Learning Engineer‘s life fairly easy and how we created informative training data for our best performing solution.

Tags Natural Language Processing Neural Networks / Deep Learning Use Case

Level Domain Expertise some Python Skill Level some

Sebastian Wanner

Affiliation: idealo.de

Sebastian is a Senior Machine Learning Engineer at idealo.de where he works mainly on NLP and Computer Vision problems to improve the product catalogue. In previous positions he applied machine learning methods to telemetry and tracking data. Sebastian holds a Master’s degree in Business Informatics from Mannheim University.

Christopher Lennan

Affiliation: idealo Internet GmbH

Christopher is the Tech Lead for the Machine Learning Engineering team at idealo.de where he consults on various ML projects and helps building ML platform products. In previous positions he applied ML methods to fMRI as well as financial data. Christopher holds a Master’s degree in statistics from Humboldt University Berlin.

visit the speaker at: Github