Automation is a key element in the digital transformation of business operations. One of the engines of this automation is data science. Problem: There are still a lot of misconceptions about artificial intelligence, algorithms, and machine learning.
To help eliminate some of them and optimize your data science, here are four good practices that should help your (data) projects to (better) succeed.
1. Understand business requirements
One of the most common misconceptions about data scientists is that they only compile data, execute models, and produce results (outputs or knowledge). Of course, they do everything. But the most important part of the job is, first of all, to establish and understand the case of using a specific model.
In other words, what business problem needs to be addressed?
For data scientists, this process is tantamount to turning the operational goal into a mathematical problem. But to get there, they need to fully understand the crux of the business problem (the famous “irritants” or “pain points” in English). Because from there will come the data sets used and the models that will be applied to them.
But data scientists can only understand this business problem by fully understanding the market in which the company operates. Therefore, data scientists need to work closely with operating teams, such as product managers, to understand very accurately how the customer perceives the problem.
2. Communicate effectively
Communicating with a business team is a good practice that seems obvious, but it’s not always that easy to do in a data science project.
Data scientists tend to have more technical training than product managers, so communicating complex mathematical solutions effectively, that is, in a way that can be understood and transmitted to end customers, remains a challenge.
You can’t display a set of formulas and say, “These meet your requirements, so here we go.”
Gaining a good understanding of how a model can respond to a business problem is a soft skill that data scientists need to develop (some even call it Data Science Storytelling). In return, the business team can help by asking good questions that will allow data experts to better identify the right sets for models.
“We need an efficient way to do X” is a simplistic, but typical, starting point for any project. But it is understood that “X” is never clearly defined. This is where data scientists need to work with the company to remove ambiguities and refine the use case.
Never underestimate the power of “Why?” Sometimes a customer’s initial request does not address their actual underlying problem.
A data scientist may also not have the data sets needed to produce the best model. Therefore, it may be necessary to give an alternative and feasible answer. In this case, it is essential to adjust the goal as much as possible. But again, this requires effective communication with the trades so that technical limitations can be conveyed to the client as soon as possible.
3. Avoid “Trash, Trash”
Data scientists face many contingencies when it comes to obtaining the necessary “inputs” for their models: either obtaining permissions to access certain data sets, regulatory issues about sensitive data, or disparity. of locations and information formats.
Once they have this raw material, gathered in one place, the data scientists manipulate the data and identify the relevant features (characteristics) that will be used to feed the models.
This process can take up to 90% of a data scientist’s time to clean up data, find missing anomalies and values, and combine data sets. Often, the tools and algorithms needed to create a particular use case already exist, using open source libraries such as Python, Tensorflow, or PyTorch. But not everyone does. That’s why feature engineering, data verification and auditing (due diligence), and manipulations to prepare data are the most time-consuming preparatory parts of the job.
Keep in mind that the “function engineering” process is, of course, guided by knowledge of the business problem. That is why the first step, understanding business needs, is a step that must be followed from the beginning of a Data Science project.
The quality of the data that data scientists enter into an algorithm ultimately determines the success of the project. However, quality is determined by the accuracy of the data itself, but also by its relevance to meet business requirements.
Any good data scientist knows that lack of data and inaccurate data is the norm in any machine learning project. Even when it comes to information recorded by highly advanced monitoring tools, a fundamental principle of physics is that a measurement is never 100% accurate. And you have to keep that in mind. Therefore, each model is, in one way or another, “fake.” But the right models still allow data scientists to get close enough to reality to answer business problems and make more effective and objective decisions.
At some point, data scientists have to decide that they have enough data to build a viable model: using what we have is what brings us closer to what we want.
4. Iterate and adapt to change
A feature of data-driven projects, such as machine learning, is that they cannot be built or considered once and for all. There is always, with more or less certainty, an evolution of the professions that will require the reconstruction of a model.
A very recent example is the evolution of customer behavior with the pandemic. Statistical models dealing with certain problems before the crisis had to be reconstructed or adjusted to respond to the new reality.
As organizations continue to adapt to the crisis, they must continually rethink these models. When should it be done? When model performance degrades, this can be seen by tracking over time. Another essential point to start in a machine learning project.
To control the efficiency of an algorithm, you need to set performance thresholds, which is pretty straightforward. When performance drops below a set threshold (the minimum required to provide useful information), it’s time for a new iteration. This involves understanding new business needs. And it all starts from the beginning.
Yujun Chen is a Senior Data Scientist at Finastra Innovation Lab. Interpret data and design models, using a wide range of statistical tools and methodologies related to machine learning to help clients solve their business problems in treasury, capital markets and corporate banking. He also has a doctorate in physics.
Dawn That She also works as a data scientist at Finastra’s Innovation Lab, where she and her team apply the latest advances in machine learning to solve problems in the field of finance. He graduated from the Georgia Institute of Technology with a background in mathematics and statistics.