A decision tree is a tool of decision-making aid and to the Exploration of data. It makes it possible to model simply, graphically and quickly a more or less complex measured phenomenon. Its legibility, its speed of execution and the little of assumptions necessary a priori explain its current popularity.
In the field of the decision-making aid (Data-processing decisional and Datawarehouse) and dated mining, some algorithms produce decision trees , used to divide a population of individuals (customers for example) in homogeneous groups, according to a whole of discriminating variables (the age, the socioprofessional category,…) according to a laid down and known objective (turnovers, answer to a mailing,…). For this reason, this technique belongs to the methods of Apprentissage supervised. It is a question of predicting with the most possible precision the values taken by the variable to predict (objective, target, variable variable of interest, attribute classifies, variable of exit,…) starting from a whole of descriptors (variable predictive, variable discriminating, variables of entries,…).
This technique is popular as much in statistics than in machine learning. Its success resides mainly at its characteristics:
For better apprehending the induction of the decision trees, we will take again an example described in the work of Quinlan (1993). It is a question of predicting the behavior of sportsmen (To play; variable to be predicted) according to data weather (Sunning, Temperature, Moisture, Wind; predictive variables).
The algorithm of training seeks to produce groups of individuals most homogeneous possible from the point of view of the variable to be predicted starting from the variables of weather. Partitioning is described using a decision tree.
On each top of the tree is described the distribution of the variable to predict. In the case of the first top, the root of the tree, we note that there are 14 observations in our file, 9 of them decided to play (To play = yes), 5 decided the opposite (To play = not).
This first top is segmented using the variable Ensoleillement, 3 sub-groups were produced. The first groups on the left (Sunning = Sun) comprises 5 observations, 2 of them correspond To play = yes, 3 To play = not.
Each top is thus repeatedly treated until one obtains from the sufficiently homogeneous groups. They correspond to the sheets of the tree, of the tops which are not segmented any more.
The reading of a decision tree is very intuitive, it is what makes its success. The tree can be translated into base of rules without losses of information. If one considers the sheet on the left, we can easily read the rule of following assignment: “ If sunning = sun and moisture < 77.5% then to play = yes”.
To build a decision tree, we must answer the 4 questions following:
How to choose, among the whole of the variables available, the variable of segmentation of a top?
To choose the variable of segmentation on a top, the algorithm is based on a very rough technique: it tests all the potential variables and chooses that which maximizes a given criterion. It is necessary thus that the criterion used characterizes the purity (or the profit in purity) at the time of the passage of the top to be segmented towards the sheets produced by the segmentation. There exists a great number of criteria informational or statistical, the most used are the entropy of Shannon and the Coefficient of Gini and their alternatives.
Another manner of characterizing the segmentation is to measure the bond or causality between the variable candidate and the variable to be predicted. In this case, the criterion more used is the bond of the KHI-2 and its derivatives.
It should be noted that certain criteria make it possible to take into account ordinal nature of the target, and this, by using measurements or the heuristic suitable ones.
Final, these criteria, for little which they make it possible to make tighten partitioning towards the constitution of pure groups, exploit little the performances of the algorithms.
In our case, each value of the variable of segmentation makes it possible to produce a sheet (cf 3 values of sunning produces 3 tops), it is the case of the C4.5 algorithm for example. The algorithms of training can differ on this point. Some such as CART produce binary trees systematically, he thus seeks during the segmentation the binary regrouping which optimizes the criterion of segmentation. Others such as CHAID seek to carry out the most relevant regroupings while being based on statistical criteria. According to the technique, we will obtain more or less broad trees, the idea is to avoid splitting the data exaggeratedly in order not to produce too weak groups of manpower, corresponding to no statistical reality.
The treatment of the continuous variables must be in agreement with the use of the criterion of segmentation. In the large majority of the cases, the principle of segmentation of the continuous variables is very simple: to sort the data according to the variable to treat, test all the possible points of cut located between two successive points and to evaluate the value of the criterion in each case. The optimal point of cut corresponds quite simply to that which maximizes the criterion of segmentation.
In our example, the algorithm thus evaluated the points of cut located halfway of observations 1 to 5 correspondents with the values {70, 70,80,85,95} of Moisture.
Indeed, we often work on a sample which one hopes for representative of a population. All the stake is thus to find the good measure between collecting the useful information, really corresponding to a relation in the population, and introducing specificities of the file on which we are working (the sample known as of training), corresponding in fact to a statistical artefact. Said Autement, it never should be forgotten that the performance of the tree is evaluated on the same data which were used for its construction: the more complex the model is (the larger the tree is, the more it has branches, the more it has sheets), the more one runs the risk to see this model unable to be extrapolated with new data, i.e. to give an account of the reality which we precisely try to apprehend. In extreme cases, and it is particularly true decision trees, a model is able to retort any sample of data exactly, without to apprehend any reality … Indeed, so in an extreme case one decides to make further push our tree possible, we can obtain a tree made up of as many sheets as individuals in the sample training. Our tree then does not make any error on this sample since he marries all the characteristics of them: performance equalizes at 100%. Since one applies this model on new data which by nature do not have toutse the characteristics of our sample of training (it acts simply of another sample) its performance thus will be degraded for in extreme cases approaching 0%…!
Thus, when one builds a decision tree, one risks what one calls surajustement a model: the model seems powerful (its average error is very weak) but it is not it actually at all! It we will be supposed to find the tree smallest possible having the greatest possible performance. Plus one tree is small and more it will be stable in its future forecasts (in statistics, the principle of parsimony almost always prevails); the more powerful one tree is, the more it is satisfactory for the analyst. It is not used for nothing to generate a model of good quality very, if this quality is not constant and is degraded when one applies this model to a new whole of data.
In other words, to avoid an on-adjustment of our trees (and it is also true of all the mathematical models which one could build), it is advisable to apply a principle of parsimony and to carry out arbitrations performance/ complexity of the models used. With comparable performance, one will always prefer the simplest model, if one wishes to be able to use this model on new completely unknown data.
In the graph opposite, we observe the evolution of the error of adjustment of a decision tree according to the number of sheets of the tree (complexity). We note that if the error decrease constantly on the sample of training, starting from a certain level of complexity (of very many branches and very many sheets) this model moves away from reality, reality which one tries to measure or in any case to approach on a second sample of data (sample test in the graph).
In the case of the decision trees, several types of algorithmic solutions were planned to try to avoid as much as possible a problem of potential on-adjustment of the models: it is of the techniques known as the pre one or post élégage of the trees.
Certain statistical theories (see work of the Russian mathematician Vladimir Vapnik) go even until having the aim of finding the optimum between the error made on the sample of training and that made on the sample of test. The theory of Vapnik Chervonenkis, “Structured Risk Minimization (SRM)”, by using a variable called “VC dimension”, provides a perfect mathematical modeling of the determination of the optimum of a model, usable consequently to generate models which ensure the best compromise between quality and robustness of the model.
In all the cases, these algorithmic solutions are complementary to performance analyzes compared and stability carried out on the samples of training and test.
Once the built tree, it is necessary to specify the rule of assignment in the sheets. A priori, if they are pure, the answer is obvious. If it is not the case, a simple rule is to decide like conclusion of the sheet the majority class, that which is represented.
This very simple technique is the optimal procedure within a very precise framework: the data result from a simple random pulling in the population; the matrix of the costs of bad assignment is unit (symmetrical) càd to affect advisedly does not cost anything (cost equal to 0), and to affect wrongly costs 1 whatever the case of figure. Apart from this groundwork, the rule of the majority is not justified. It remains nevertheless that it is a reference frame easy to use in practice.
From these 4 points, there exists a very great number of alternatives. Some lay the stress on such or such aspect of the algorithm of construction, but there does not exist method which is in practice systematically more powerful. Here a nonexhaustive list of the algorithms classically used:
These algorithms are characterized by the criteria of segmentation used, by the implemented methods of élégages… but also by their manner of managing the given missing in our predictors. This point takes all its importance in practice. Indeed, the data at disposal in company for example are quasi systematically incomplete. What to make in the presence of missing data?
The trees knew a Net renewed interest when the methods of aggregation of the classifieurs such as the Boosting or the bagging were developed and popularized in the community of the machine Learning. Some of the characteristics of the trees, in particular them very marked variability , enable them to benefit from the advantages of the combination of the predictors. Specific techniques were even developed to produce trees individually not very effective, but when they are combined, prove very powerful, it is the case in particular of Random Forests (forests of trees) of Breiman.
In processes of Data mining the decision trees are sometimes combined with more traditional techniques of modeling of a laid down objective: analyzes discriminating, logisitic regressions (logistic regression), linear regressions (linear regression), networks of neurons (will perceptrons multi-layer, radial basis function network…)… They also are very often combined between them to benefit from their respective advantages. Procedures of aggregation of the performances of the various models used, sometimes called procedures of vote, are then installation to obtain a maximum performance, while controlling and by minimizing the level of complexity of the whole of the models used.
The decision trees such as they were presented here are dedicated to the supervised training where one tries to predict (to explain) the values taken by a discrete variable (To be sick or not, Répondre positively an promotional offer or not, etc) starting from a series of discriminating variables of unspecified type. Actually, the step can be wide with other types of problems.
When the target variable is continuous, we are located within the framework of the regression (the most popular method within this framework is the linear Regression). The same diagram of exploration can be applied, except that instead of optimizing the error rate, we optimize the quadratic error, and the criterion of segmentation becomes the decomposition of the variance: we choose the variable which maximizes the variance between classes. One speaks in this case about trees about regression, method CART, in his version RT (Tree Regression) is most known.
When we deal with several target variables, we are located in a problem of automatic Classification (or of typology). One speaks in this case about trees about classification (Clustering Tree in English). The criterion of evaluation of the partitions is the inertia, which is neither more nor less than one generalization of the variance within a multidimensional framework. The most known methods are due to Chavent (1998, the Methods Divisives Monothétiques) and Blockeel (1998, Predictive Clustering Tree). It seems that the typologies suggested by these techniques are also good, in terms of explained inertia, that those produced by the traditional approaches (Ascending hierarchical clustering, dynamic Nuées, etc), with the legibility of the results moreover.
L. Breiman, J. Friedman, R. Olshen, C. Stone: CART: Classification and Regression Trees , Wadsworth International, 1984.
Decision trees, Tutoriel on line, Review MODULAD n°33.
SPSS, Clementine, SAS, SAS Enterprise To mine, ALICE, SPAD, STATISTICA, SYSTAT
Certain business packages of campaigns marketing and/or optimization of campaigns marketing (it is also the case of other functional problems) integrate in particular algorithms of construction of decision tree (thus very often that heuristic functional calculuses), in a completely transparent way, and make from there the use directly operational:
Today, majority of the large commercial editors of relational databases (relational Database, Oracle (database), Microsoft SQL Server…) many algorithms of Data mining embark of which certain decision trees. The idea is to make it possible to the analysts to exploit all the computing power of these databases and the machines waiter on which they are installed.
| Random links: | Méricourt-in-Vimeu | Roger Bambuck | Foumban | Olivier Rouault | Saint-Exupéry, prince of the pilots |