When data scientists build data products, they usually need to combine multiple data sources to train their models and then serve predictions. Making sure that the code and the data will be as expected throughout the full lifetime of the project is complex. To ensure the quality of the code, it is a best practice in software engineering to use automatic testing, this has a large corpus of support material. However, ensuring the quality of the data input and output holistically is not yet as well covered.
In this talk, I will explain the concept of data unit tests and why they are important. Then I will present an overview of the current libraries helping to build data unit tests. Finally, I will explain how we integrated it into our workflow at GetYourGuide.
Theodore Meynard is a data scientist at GetYourGuide. He works on our recommender system to help customers to find the best activities to book and locations to explore. Before GetYourGuide, he was building the recommendation system at plista to help online newspapers to monetize their content. When he is not programming, he loves to ride his bike looking for the best bakery-patisserie in town.