Efficient data labelling with weak supervision Maria Mestre PyConDE & PyDataBerlin 2022 conference

Efficient data labelling with weak supervision

Maria Mestre

Tuesday 13:50 in Kuppelsaal tuesday tuesday-13-50

Type/Track Talk pydata-natural-language-processing

Labelling data is a tedious and expensive process, and it is often the point of failure in machine learning projects. If the data has not been labelled correctly, or the annotation taxonomy and definitions change, this impacts all the downstream tasks of the ML project. Data exploration, labelling and model training are tightly integrated and should be approached as a single iterative process. In this talk, we will show how to use DataQA, a Python open-source platform and library to perform text exploration and labelling. DataQA offers functionality to apply weak supervision techniques to automatically label large corpora of documents for tasks such as classification or named entity recognition. These techniques provide safeguards in cases when the label definitions change. We will show how to apply them successfully to an e-commerce classification task and a health entity extraction task.

Tags Data Engineering Data Visualization Natural Language Processing

Level Domain Expertise none Python Skill Level none

Maria Mestre

Affiliation: DataQA

After completing a PhD in signal processing & machine learning at Cambridge university, Maria went to work at different companies building ML solutions to solve problems across many domains (healthcare, finance, adtech). She is now CEO and co-founder of DataQA, a no-code platform to extract information from text using advanced NLP techniques.

visit the speaker at: Github • Homepage