Data mining refers to identifying various types of information (not disputed a priori) through extrapolation targeted by large databases, single or multiple (in the second case, a more accurate statement is obtained by crossing the data of individual banks).
The techniques and strategies applied to data mining operations are primarily automated, consisting of specific software and algorithms suitable for a single purpose. To date, in particular, neural networks, decision trees, clustering and association analysis are used. The purposes of data mining applied to the most varied fields: economic, scientific, operational, etc.
Discovering Data Mining
To fully understand what data mining is, beyond the technical definitions, however accurate, it may be helpful to start from its purposes, providing some examples. Let’s take the following questions:
- How to sell smartphones to a seventy-year-old consumer target?
- Could a black hole be hiding at the centre of a recently discovered remote galaxy?
The answer to these questions, or part of them, can be contained in the databases. The problem is that it is unintelligibly so. No one, today, could handle big data in good times, that is, the vast and heterogeneous masses of data contained in data warehouses.
This is where data mining comes into play, which manages to find associations, anomalies and recurring patterns (patterns), therefore ultimately information, within them. But above all, thanks to the high parallelism of the computing resources used (alongside highly specialized operators), it manages to do so with an efficiency that far exceeds that of a human operator who analyzed them manually.
In short, data mining ensures that starting from “cryptic” information, disseminated without apparent order in a database (textual, multimedia, mixed data, etc.), we arrive at knowledge that can be used for various purposes. The whole process is called KDD (an acronym for Knowledge Discovery in Databases), and in reality, it does not end with the actual data mining procedure.
The KDD sequence has several steps, the main ones being:
- identification of the goal to be achieved;
- preselection of the data values to reach it;
- data cleaning and preprocessing: further separation between valid and useless data, choice of how to treat incomplete or empty fields, a definitive selection of fundamental information for the ideal reference model;
- Transformation: is the format in which the data are represented valid to the analysis software? If the answer is no, the data must be converted;
- Data mining: it is, of course, the most crucial step. The best software is chosen for the individual case, which selectively scans the data warehouse to provide the desired answer. Data mining usually consists of several subways, even repeated several times, to refine the procedure and gradually verify the results achieved;
- interpretation of the results: it is evaluated if the objective is completed, and if the answer is no, it proceeds with the reiteration (and possible modification) of the previous step and sometimes also of others;
- Display of results in an understandable format.
Tasks For Data Mining
The main tasks for data mining are:
- classification: identification of classes (based on specific rules) and of the set of elements united by correspondence to the same;
- clustering (or segmentation): identification of groups of homogeneous elements, which, unlike what happens in the classification, are based on hidden rules until the moment of their discovery;
- association: discovery of random but recurring links that can be extrapolated from the data contained in a database, aimed, for example, at detecting anomalies;
- regression: similar to classification, from which it differs in that the variables (i.e. the rules of belonging to a class), of a definite type in the categories, in the case of regressions, can instead assume a large or infinite number of values;
- time-series (or historical series): these are complex regressions that incorporate time variables (dates, changes in interest rates, etc.) and are therefore particularly useful for predictive purposes;
- Sequence discovery (discovery of sequences): takes up the concept of association but applies the sequential correlation factor, i.e. detecting when A (for example, purchase of a toy) follows B (purchase in a certain subsequent period of an option for that toy ).
Data Mining Tools
Depending on the goal, the tools for data mining can change. Not infrequently, then, the various methods can be integrated. A neural network is a particular program that traces the functioning of a biological neural network in some respects. This program is equipped with instructions and a learning algorithm that allows it to evolve with experience, expanding its ability to solve certain types of problems.
A supervised learning neural network is trained by providing inputs (problems) and outputs (solutions). By detecting the associations, it learns to produce correct results autonomously. An unsupervised learning neural network instead is trained only with inputs consisting of selected types of data. By examining them, the network learns to grasp similarities and differences, making classifications. Thanks to the high parallel computing capacity, these two categories of neural networks can profitably and efficiently process big data, carrying out types, associations and clustering.
A decision tree is a graph in which, starting from the root (training set), classification is carried out through a path that is each time a choice between various branches, or subsets (called nodes), whose branches are the alternatives leading to the different leaves (results or classes). A correctly implemented decision tree must have adequate dimensions, which means not excessive: too many variables would make an algorithm that is fast and efficient chaotic and slow. Decision trees are used for segmentation, classification, regression, and time-series operations in data mining.
Main Fields Of Application
The fields of application of data mining are innumerable but can be grouped into some macro-categories. The main ones are:
- marketing;
- economics and finance;
- science;
- information and communication technologies (ICT);
- statistics;
- Industry.
In the vast field of marketing, the main applications of data mining concern:
- customer clustering (database marketing): identification of types of buyers sharing purchasing habits and socio-demographic characteristics;
- customer retriever: analyzing the behaviour of a brand’s customers, it becomes predictable to identify those at risk of abandonment, and therefore adopt appropriate strategies to prevent it;
- Market basket analysis: which products or services are usually bought together? With the study of associations, it is possible to understand this.
In the financial sphere, data mining applies, among other things, to:
- fraud detection: by analyzing, for example, the use of credit cards, it is possible to identify anomalies and finally trace fraudulent behaviours;
- forecasts on stock index trends ;
- Analysis of the interactions between financial markets: effective for predicting the general trend of the needs on the single market.
Also, as regards the scientific field, data mining is used in an endless number of sectors, assuming particular relevance in:
- Medicine and biology: clinic, genomics, pharmacology, etc. Particularly for clinics and pharmacology, data mining is good support for decision making. In concrete terms, it means that it provides a predictive model based on the knowledge it can provide, thus influencing the decision-making process. Among the various examples, we find the choice of treatment protocols, selection of suitable surgical prostheses, etc.;
- Meteorology: the accuracy of meteorological forecasts depends on the cross-analysis of enormous amounts of data; a perfect “bread for the teeth” of data mining in short. One of the most relevant examples are the (exterminated) data sent by satellites;
- Astronomy: classification and identification of stars, galaxies, planets, satellites and other celestial bodies. In the field of statistics, data mining speeds up demographic analyzes and, above all, obtains information from it that is precluded to standard statistical methods, managing to provide valid predictive models.
- Industry: the increase in productivity is made possible by analyzes capable of identifying errors or inefficiencies in the production chains, from support to logistics, etc.
Data Mining: A Privacy Risk?
The downside of data mining is the potential privacy-violating effect it holds. Take, for example, the careful segmentation of a target consumer for marketing purposes. It is one of the achievements of data mining, but the side result is that profiling highlights the individual’s characteristics without being aware of them. Nor, therefore, without his having given his consent.
The two sides of the coin cannot be separated. Put, the more you know about an individual, the better you can push them towards a particular purchase. Therefore, this knowledge process is articulated in a 360 ° observation, ranging from purchasing habits to information on the patrimonial situation, from the psychology of the individual to sexual practices, from the discovery of ethnicity to that of religious belief, and so on. Everything is helpful for marketing purposes.
Also Read: What Is Machine Learning?