About the Future of the Data Scientist Profession (e)

For some time now, while browsing the web, I have been able to read here and there a number of posts and videos about the future of the Data Scientist profession. This article tries, through my exercise of the profession and my own experience, to offer you objectively the first elements of an answer.

Some see that Data Scientists themselves could be replaced in the long run by increasingly efficient AutoML algorithms, others that the Data Scientist profession, such as the Data Engineer profession, will specialize so much that the same Data Scientist title could be questioned. That this title is sometimes extremely coveted to the point of being controversial is beyond doubt. It is clear that data science approaches to business are very diverse and rapidly evolving. But a central question is currently being asked: Is the title of Data Scientist doomed to disappear in the long run? This article attempts to provide you with the first elements of the answer objectively.

First of all, it should be noted that the Data Scientist profession is constantly evolving, which is also a testament to its vivacity. A few years ago it split into two categories with the birth of the Data Engineer role, which gave rise to the more recent ML Engineer role. The ML Engineer is primarily a Data Engineer specializing in model production, deployment, operations, model tracking, and to summarize in one word: ML Ops. Therefore, there have been successive specializations since the beginning. Data Science and AI (Artificial Intelligence) techniques are becoming more widespread, due to the diversity of methods (supervised, unsupervised, deep, reinforced learning), technologies, languages ​​available (Python, R, Scala, Java, Julia) either in the tools of the main cloud platforms (AWS, Azure, GCP), or in more general Data Science platforms such as KNIME, DATAIKU, ALTERYX, SAS, etc. This leads us to ask the following question: What does a good Data Scientist do? to wait?

Data science is a highly hybrid discipline that requires an understanding of business needs, knowledge of data pipeline preparation, model design and testing (including math, statistics, machine learning, and AI), production, and model tracking. Of course, all this requires a great deal of technological mastery, and in this list, the preparation of data pipelines, the launch of production, and the tracking of models now fall primarily to the data engineer and the ML engineer.

I believe that the central role of the data scientist must remain fundamentally in understanding business needs and translating those needs into working models. This requires an essential skill in advanced math, statistics, and machine learning. In fact, designing artificial intelligence and machine learning models without an excellent mastery of statistics is very risky. This is not about copying and pasting Python code that is on the Internet without gaining height, but about really understanding how algorithms will behave and interact based on variations in input data, possible biases on data, and derives from the data. problems that can arise over time.

The choice of algorithms in the models is a point that remains delicate, because the complexity of the algorithms used must be put into perspective with the complexity of the input data. An algorithm that is too simple or too complex compared to the intrinsic complexity of the data will always give aberrant results.

Then there is the adjustment of the parameters of the models, what we call in our slang the “hyper parameters”. It is true that more and more platforms (and libraries) offer the so-called “autoML” functionalities, which consist of automating the configuration of these parameters. This sometimes even automatically adjusts the choice of algorithm. It works in some cases, but atypical values ​​are still common.

AutoML is nothing more than ML (machine learning) applied to itself. It may be of interest at the beginning of modeling because it saves time, but it is often insufficient later on when the model is actually perfected. In fact, the choice of these parameters is sometimes not neutral for operations. It is often necessary for the Data Scientist to be able to interpret the consequences of choosing these parameters for the business.

For example, in a referral engine, a ranking threshold will have a direct impact on your targeted billing, margin, and population ratio during a marketing campaign: Do they really want to ignore the impacts of these adjustments on their results?

Therefore, the Data Scientist plays a key role in the dialogue with the professions to propose and explain the consequences of the important decisions that will be made during the modeling. Then, even though the data channels are prepared by the data engineers, it is the data scientist who decides which data (internal and external) to use, and who will frame what is called a “function engineer”.

Feature Engineering plays an essential role in transforming and enriching model input data, and I’m not just talking about data quality here. A model trained with raw data is worthless. A model will be all the more effective if the data has been prepared, refined (a bit like refining oil into gasoline) and enriched. We will recode this variable so that its distribution is compatible with the type of algorithm used. We will cross this predictor with another to create a more relevant indicator as an input to the algorithm. I think this is where the impact of human reasoning on the end result is most important, and it is above all the care that is taken during this phase that will characterize the competence of a DataScientist. ). A simpler model with better prepared data is better than a very sophisticated model with poorly prepared data.

This not only requires algorithmic skills, but also very well mastered statistical and mathematical skills. We can see it this way: Big Data is just a very large data space, which has to become what mathematicians call “vector space” in order to be processed in good condition by algorithms. It is essential that data scientists be able to understand how to operate this transformation and be able to perform intelligent mathematical and statistical operations from this vector space to add business value.

You may be surprised if I tell you that this phase represents most of the work of a Data Scientist today compared to the time spent developing and testing the model. This is the key step that will make the difference between a good model and a bad one.

This phase cannot be automated “a priori”. That is, it must have been done completely manually once in a new dataset in order to be able to automate it later. There are several reasons for this, the first being that this phase requires an inductive and statistical approach, consisting of data exploration, hypotheses, and statistical testing. Although there are recoding and function engineering methods, this phase is so variable and dependent on data that no automated AI platform has yet been able to provide a truly satisfactory automated response at this stage of the process. For example, I saw the accuracy of the same algorithm multiplied by a factor of 2 by recoding a single variable between several hundred, without changing any commas in the model.

The professional Data Scientist will know thanks to his scientific approach and his intuition which variables to recode to get the best results, today’s best AI is unable to do it alone.

The function of Data Scientist (with the tools and technologies we have today) is therefore completely indispensable, not only for the choice of data, its preparation, characteristic engineering, modeling and testing, but especially to make the link between what is done. from a mathematical point of view and trades.

This therefore requires dual competence in mathematics / statistics and communication. Above all, you need to be good at communication, math, and statistics to explain what is going on and offer the right options for the trades.

Data science is also a team effort, trades, data scientists, data engineers and operations need to work together. If an element of the chain is missing and the model will never go into production and then we can say goodbye to ROI

Another argument (if necessary) is that, so far, the main outlet for high-level applied mathematics has been in the physical sciences, engineering sciences, and finance. But today data science is a whole new possible path, even if I think the links to the teaching of mathematics are still too little proposed.

In conclusion, I am willing to speculate that in the future, beyond its day-to-day business practice, data science will lead to unexpected discoveries in the world of mathematics, perhaps even in theoretical mathematics, where I believe that still there is room for important new discoveries. .

Therefore, I wish a long life to the position of DataScientist (e)!

Leave a Comment