Efficient data labelling with weak supervision
Maria Mestre
Labelling data is a tedious and expensive process, and it is often the point of failure in machine learning projects. If the data has not been labelled correctly, or the annotation taxonomy and definitions change, this impacts all the downstream tasks of the ML project. Data exploration, labelling and model training are tightly integrated and should be approached as a single iterative process. In this talk, we will show how to use DataQA, a Python open-source platform and library to perform text exploration and labelling. DataQA offers functionality to apply weak supervision techniques to automatically label large corpora of documents for tasks such as classification or named entity recognition. These techniques provide safeguards in cases when the label definitions change. We will show how to apply them successfully to an e-commerce classification task and a health entity extraction task.
Maria Mestre
Affiliation: DataQA
After completing a PhD in signal processing & machine learning at Cambridge university, Maria went to work at different companies building ML solutions to solve problems across many domains (healthcare, finance, adtech). She is now CEO and co-founder of DataQA, a no-code platform to extract information from text using advanced NLP techniques.