Do I need to be Dr. Frankenstein to create real-ish synthetic data?
Gatha
“The best way to find happiness is not to search for it but create it.”
As data scientists and statisticians, part of our happiness lies in the availability of the datasets that suit our requirements. Don’t we count ourselves lucky on finding freely available datasets suitable for the problem at hand? Or do a mental dance when the said dataset is clean, balanced, and comes with non-cryptic meta-data? But such instances are rare, and this is why the acceptance for synthetic datasets is gaining steam.
Synthetic datasets aka fake or proxy data have been around for a long time but recently received much attention due to multiple advantages. Firstly, privacy guarantees hold stronger when fake data is shared instead. Thanks to sophisticated reverse-engineering algorithms, perturbation or suppression methods need to work harder to protect the data subjects. Moreover, such methods ask for computational resources that are not readily available to all. Secondly, the real data used for training a model can be biased against certain demography. With growing concerns for bias, synthetic data holds key to the development of more trustworthy models. Additionally, the reluctance of a data owner or the novelty of a situation may mean that certain data may not be available at all. Something as novel as SARS-Cov-2 required datasets to accelerate AI-driven medical research. But major hurdles to the data collection included the novelty of the affliction and the hesitance on the part of the patients.
Recent history has seen many instances where synthetic datasets have saved the day. I propose to present the need for synthetic data, discuss various aspects of its application, the metrics to measure its realness, and how to make some of your own without harvesting the power of lightning. My talk will focus on:
Relevance of synthetic datasets in data science. (Duration: 1-2: 2 minutes)
Different types of fake data and applications to use cases. (Duration: 3-6: 4 minutes)
How to measure the realness of synthetic data. (Duration: 7-11: 5 minutes)
Some Python libraries and methods to generate synthetic data. (Duration: 12-19: 8 minutes)
A simple short piece of Python code to generate synthetic data. (Duration: 20-27: 8 minutes)
The intended target audience is broad, since anyone who practices data science has been on an eternal search for suitable datasets. Moreover, privacy guarantees are the need of the hour and synthetic datasets are becoming popular means of achieving them. A talk that simplifies this concept will benefit the audience to rethink their data acquisition strategies. It will also stimulate them to try and create fake data suitable to their domain.
Gatha
Affiliation: Amity University, Noida
A researcher passionate about data ethics. I love to simplify and bring complex research to everyone.
visit the speaker at: Github