Data science is an interdisciplinary field that extracts valuable information and knowledge from structured or unstructured data. From a business perspective, data science allows you to translate a business problem into a research and analysis project and then transform it into a practical solution with the help of data. Today, Data Science is a multidisciplinary science that requires skills closer to the business world, linked to the ability to read, interpret, understand, and capitalise on data to extract helpful value from it.
And it is precisely from this definition that the phases of a typical Data Science process derive degrees that concern a method of data analysis and interpretation that must be seen more as iterative than linear and subject to continuous verification. The cyclical aspect of the process does not prevent the identification of the fundamental steps. Let’s find out what they are below.
If, as we have said, the purpose of Data Science is to translate a problem into an analysis project and then into a practical solution, the first important step to take is to identify and understand the problem. In fact, before solving a problem, it is essential to define precisely what it is, which means being able to “translate” data questions into something usable. Simplifying and generalising a bit, data science helps answer five basic types of questions:
Looking at these questions from a business perspective, what is needed to identify and understand the problem is to ask the right questions to interlocutors (business people), who often provide ambiguous and subjective input. It would help if you had the ability—and intuition—to transform those inputs into information to ask the right questions, which will be needed to have outputs that can be used for the other phases of the Data Science process. For example, business questions such as:
It could help to “find out” the problem (some customers buy less than expected or behave differently), from which to start a series of analyses to make certain decisions (continue to invest in a product or change the offer). It is essential that at the end of this phase, there are all the elements to define the specific business and corporate context and to have clearly outlined the problem so that the analysis project, even before thinking about the data, can be planned to give a concrete answer to a clear business need (also of an organisational nature). This phase is critical because it allows you to identify a clear goal for what you want to do with the data.
Also Read: Data Scientist Vs. Data Analyst Vs. Data Engineer: Differences?
Once you define your problem and clarify your goal, you need to collect data and find the datasets you need to solve the problem. Data selection is the process of collecting data from different sources. This phase of the process requires some attention because it involves thinking a priori about which data will be needed and actually “retrieving” the data from a plurality of sources (internal to the company and external datasets).
In fact, the data can be structured (for example, from databases and internal company applications, such as a CRM or an industrial application, for example, for production or warehouse management) or unstructured (texts, images, and videos coming from emails, documents, and collaboration platforms, but also from external sources such as social networks, open document repositories, web pages, etc.). Other situations can occur when companies need more data at the project’s start experiment. Data Augmentation can help overcome this initial obstacle and ensure the project’s feasibility.
The Data Cleaning (or data preparation) phase consists of manipulating, or pre-processing, those raw data coming from various sources and in different formats to clean, arrange, harmonise, and transform them into data that analysis tools can use. Data preprocessing is the most time-consuming phase. It involves data preparation procedures such as profiling, cleansing, validation, and data transformation (often with ETL (Extract, Transform, Load, and Data Quality) technologies). Working on “data cleansing” means:
An integral part of Data Preparation is also the so-called data enrichment, i.e., the set of processes necessary to integrate and improve the raw data and information in the databases (the process is essential because it allows you to compare data from different sources and to unify and integrate them to arrive at having intact, accurate, and complete datasets). Although this is a demanding phase, it must be carried out with the utmost care because inconsistent, inconsistent, missing, poor quality, etc., data can be transferred to the next phase.
It involves errors in the analysis models and outputs that may need to be more effective in decision-making or, even worse, misleading. Additionally, regulatory compliance is a crucial element of data preparation. It is essential to ensure from the early stages of the project that the data used (and the methods with which they are used and analysed) comply with privacy and data protection laws.
Once the data has been obtained, the Data Exploration phase continues by conducting an initial “exploratory analysis.” In essence, statistical tests are carried out, the first analyses are carried out, and the first data visualisation techniques are verified. In this phase, the Data Scientists identify and prepare what is necessary to experiment with the analytical models, understand their performance concerning the problem to be solved and the available data, and, above all, identify any “bias in the data”.
And from here, the concept of an iterative and non-linear process emerges. In the Data Exploration phase, errors in the data or, in any case, a need for intervention may appear that “bring back” the teams to the previous cleaning, preparation, and data enrichment phase. Part of the Data Exploration phase—or closely connected—are experimentation and modelling, i.e., identifying and building the analysis model for solving the specific problem identified in the first phase of the entire Data Science process. These phases imply the “development” of all the control and validation parameters (including the choice of algorithms and their possible “tuning”) of the analytical model.
The latter is therefore tested by exploiting the transformed data, and based on the outputs generated (i.e., the insights obtained), its performance and effectiveness are evaluated in terms of the accuracy of the information and its actual value concerning the decision-making process. In this phase, the Data Visualization systems are also tested to verify that the information generated by the analysis models is accessible, usable, and understandable to the business people involved in the decision-making process.
It is in this phase that the actual data mining takes shape [although the whole process, typically interactive and iterative, of research, extraction, and interpretation of patterns from data—referred to as KDD—implies the repeated application of specific methods and data mining algorithms and the performance of the patterns generated by those algorithms]. At this point in the process, algorithms are used to analyse the data, uncover hidden patterns, or extract exciting knowledge.
These are this phase’s “typical” operations: identification of parameters, processing, modelling, and model evaluation. Indeed, it is here that we define how to extract practical value from large volumes of information, choosing the algorithms and “training” methods to search for patterns in the data (for example, with machine learning), as well as the form of representation of the set of different models with which you want to extract information (classification rules, decision trees, regression, clustering, etc.).
The exact interpretation of the patterns found could represent a possible return to the previous phases of the Data Science process for further iterations. After testing the first models, Data Scientists could identify others to do more in-depth analysis (for example, to discover trends in the data that were not distinguishable in initial graphs and statistics) or to “build forecasts” (for example, by analysing past data and finding the features that influenced past trends to build models for so-called predictive analytics).
An essential part of this phase of the Data Science process is also to provide business people with all the necessary elements (both quantitative and qualitative) to be able to access information and knowledge that are genuinely relevant to the identified problem, the possible applicable solution, and, therefore, effective for the business decision (which is why after Data Exploration a lot of time is often dedicated to modelling). In other words, it is in this phase that, after having made all the assessments and possibly the necessary iterations, the Data Scientists make the model operational, making it available to business people (primarily through Data Visualization systems).
As mentioned, data visualisation comes into play several times during the various phases of the typical Data Science process. Therefore, although it finds ample space in phase six, it is also good to place it in the previous steps, especially in the Data Exploration phase. The “final” stage of the process concerns the communication of the results deriving from the analyses, understood not so much as the return of information by the Data Scientists to the business people but rather as the visualisation of these results through the analysis systems, which must be made readily available and usable by business users.
This is where Data Visualization and Data Storytelling systems come into play, i.e., advanced data analysis systems that allow you to “read” among hundreds and thousands of data (of different formats and structures, coming from diversified fakes, such as the Big Data), information, correlations, and patterns, i.e., to find “a story” hidden in this data that can only “surface” through advanced analysis and can become usable for business people without specific technical skills, thanks to data storytelling and information visualisation.
Once the analysis system is in production, one must avoid thinking the process is finished. It is essential to continue monitoring the performance of the models for the business objectives (models and algorithms are ineffective and will not perform indefinitely). One of the biggest mistakes in Data Science is thinking that once a model has been developed and made operational, it will always continue to work effectively.
In reality, the quality of the models tends to deteriorate, and the Data Scientists are called upon to continuously improve them by working on the data (feeding the systems with new data, for example), i.e., starting with the Data Selection and gradually moving along the other phases of the Data process. Science.
Also Read: Seven Statistical Concepts To Know In Data Science