this /static/media/twitter/BKSMFA.png

History of data science

Since 60s the the focus of mathematicians and statisticians have been shifted to data analysis and the term Data Science is used for the first time in 1974, by Peter Naur in his "Concise Survey of Computer Methods". He defines it as the "science of dealing with data". In 1976 John Chambers at Bell Labs create programming language S. This lays the basis for statistical computing and quantitative programming environments (QPE) that use scripts and workflows. In the 1990s, S inspires the creation of an open source language called R. R had been for many years the main programming language for Data Analysts and Data Scientists. In 1997 Professor C. F. Jeff Wu calls for statistics to be renamed data science and statisticians to be renamed data scientists. The same year the journal Data Mining and Knowledge Discovery is launched. In 2001 William S. Cleveland publishes "Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics". To Cleveland, a data analyst is good at programming but has limited knowledge of statistics. A data scientist on the other hand comes from statistics background but has to work more closely with computer specialists. By the start of the decade 2010, researchers and writers attempt to explain data science to the public. Data scientist is claimed to be the sexiest job of the 21st century.

Changing the trend

There have been texts and internet posts about the cool-down era for data science but none of the articles and texts mentioned provide any convincing numbers and statistics for that. However, we know from our work experiences there is a shift towards engineering (ML engineering) and DevOPs (ML-OP) tasks in the companies hiring data scientists and the expectations of the job descriptions are changing. I have also looked into Google trend and the graph below shows a decrease of interest on Google searches for Data Science since 2020.

(could not upload graph!)

What is Data Science and why do we need data scientists

As we already know Data Science comes from Mathematics and Statistics background. As I once heard somewhere and found this description accurate, Data Science is a child of Statistics and Computer Science. However the job description of data science has been always vague. If we describe a full pipeline like this: Raw Data > Data Wrangling > Data Cleansing > Data Preparation > Model Learning & Validation > Model Deployment > Visualization Then this pipeline inquires a full set of knowledge in:

  • Programming
  • Statistics & Machine Learning
  • Field knowledge
  • Business knowledge
  • Visualization skills
  • And maybe cloud computation

In real life, it would be almost impossible to find someone with all these skills and that might the main reason behind initial confusion about Data Science jobs descriptions and different companies and organizations define Data Scientist tasks as part of the pipeline they need support. Looking at how successful organizations work and how product teams are being built one can suggest that the pipeline above need engineers and data analysts as well as data scientists. So how could we really define data science? Do we really need data scientists? I believe the data science is not cooling down but it is finally becoming mature and it is finding it’s place in the whole tech product world and that is good news. I think we should all in different positions try defining Data Science better and that would be the most reasonable definitions:

  • Translating Business needs to models. That means to think how a business can use the value of ML models and which models can be used. This needs strong field knowledge as well as business knowledge.
  • Exploratory Analysis, statistics and extracting information from data, and this includes visualization
  • Basic knowledge in deployment of models. Data Scientists might not deploy models if ML engineers are available but they should know how models get deployed, because otherwise they might develop models which are very slow or useless when it comes to deployment.

Dr. Setareh Sadjadi

Affiliation: Diconium GmbH

Passionate about data, ethics and diversity. Aiming to build great and inclusive data Teams and data products. Have a PhD in Process Engineering, and worked as a social worker before falling in love with data science.