this /static/media/twitter/WADNGC.png

There exist so many tools to do data science today that it is sometimes difficult to navigate. Many of them are AI platforms that “do everything by clicking on a UI” and do not leverage pre-existing tools e.g., GIT for versioning, or good old python IDE instead of Jupyter Notebooks. On the other hand, ML engineering is not classical software engineering:

  • in addition to the code, the data should also be versioned;
  • in its essence, ML engineering is an exploratory work: one can not know if the model is going to work before testing it;
  • there is no clear way to guarantee the quality of the trained model: the data-scientist has to play with it to make it “talk”.

In this talk, we will build a fully customizable and complete system in Python to track Machine Learning experiments. For the purpose of this talk, we will train a neural network (Tensorflow) to classify images between cat and dog, though, the main focus is on the tooling and not the ML algorithm. We will use:

  • DVC) (Data Version Control) to 1) version the data alongside the code with GIT 2) build training pipelines to orchestrate the python scripts 3) version experiments.
  • Streamlit) to build data exploration apps to play with the trained models.

Both DVC and Streamlit are open-source libraries with python APIs. In the second part of the talk, we will focus on various ways of combining DVC and Streamlit. For instance, we will see how to build a Streamlit app that allows selecting any trained model tracked with DVC (provided its GIT commit), loading it, and testing it on given input images.

During the talk, I will provide actionable code samples and live demos.

Antoine Toubhans

Affiliation: Sicara

Python developer and Data-Scientist, I am Head of Science at Sicara since 2018.

I am also the organizer of the Paris Computer Vision Meetup](https://www.meetup.com/Meetup-Computer-Vision-Paris/.

visit the speaker at: Github