A decision tree is a tool of decision-making aid and to the Exploration of data. It makes it possible to model simply, graphically and quickly a more or less complex measured phenomenon. Its legibility, its speed of execution and the little of assumptions necessary a priori explain its current popularity.

Introduction

In the field of the decision-making aid (Data-processing decisional and Datawarehouse) and dated mining, some algorithms produce decision trees , used to divide a population of individuals (customers for example) in homogeneous groups, according to a whole of discriminating variables (the age, the socioprofessional category,…) according to a laid down and known objective (turnovers, answer to a mailing,…). For this reason, this technique belongs to the methods of Apprentissage supervised. It is a question of predicting with the most possible precision the values taken by the variable to predict (objective, target, variable variable of interest, attribute classifies, variable of exit,…) starting from a whole of descriptors (variable predictive, variable discriminating, variables of entries,…).

This technique is popular as much in statistics than in machine learning. Its success resides mainly at its characteristics:

  • legibility of the model of prediction, the decision tree, provided. This characteristic is very important, because the work of the analyst also consists in rendering comprehensible its results in order to carry the adhesion of the decision makers.
  • capacity automatically to select the discriminating variables in a data file containing a very great number of potentially interesting variables. In this direction, a decision tree constitutes a privileged exploratory technique to apprehend large data files.
But

Didactic example

For better apprehending the induction of the decision trees, we will take again an example described in the work of Quinlan (1993). It is a question of predicting the behavior of sportsmen (To play; variable to be predicted) according to data weather (Sunning, Temperature, Moisture, Wind; predictive variables).

The algorithm of training seeks to produce groups of individuals most homogeneous possible from the point of view of the variable to be predicted starting from the variables of weather. Partitioning is described using a decision tree.

On each top of the tree is described the distribution of the variable to predict. In the case of the first top, the root of the tree, we note that there are 14 observations in our file, 9 of them decided to play (To play = yes), 5 decided the opposite (To play = not).

This first top is segmented using the variable Ensoleillement, 3 sub-groups were produced. The first groups on the left (Sunning = Sun) comprises 5 observations, 2 of them correspond To play = yes, 3 To play = not.

Each top is thus repeatedly treated until one obtains from the sufficiently homogeneous groups. They correspond to the sheets of the tree, of the tops which are not segmented any more.

The reading of a decision tree is very intuitive, it is what makes its success. The tree can be translated into base of rules without losses of information. If one considers the sheet on the left, we can easily read the rule of following assignment: “ If sunning = sun and moisture < 77.5% then to play = yes”.

Construction of a decision tree

To build a decision tree, we must answer the 4 questions following:

  • How to choose, among the whole of the variables available, the variable of segmentation of a top?

  • When the variable is continuous, it is the case of the Humidité variable, how to determine the threshold of cut during the segmentation (value 77.5 in the decision tree above)?
  • How to determine the good size of the tree? Is it desirable to produce absolutely pure sheets according to the variable to predict, even if the group corresponding corresponds to a very weak fraction of the observations?
  • Lastly, how to affect the value of the variable to be predicted in the sheets? When the group is pure the answer is obvious, in the contrary case, we will see that we should adopt a strategy.

Criterion of segmentation

To choose the variable of segmentation on a top, the algorithm is based on a very rough technique: it tests all the potential variables and chooses that which maximizes a given criterion. It is necessary thus that the criterion used characterizes the purity (or the profit in purity) at the time of the passage of the top to be segmented towards the sheets produced by the segmentation. There exists a great number of criteria informational or statistical, the most used are the entropy of Shannon and the Coefficient of Gini and their alternatives.

Another manner of characterizing the segmentation is to measure the bond or causality between the variable candidate and the variable to be predicted. In this case, the criterion more used is the bond of the KHI-2 and its derivatives.

It should be noted that certain criteria make it possible to take into account ordinal nature of the target, and this, by using measurements or the heuristic suitable ones.

Final, these criteria, for little which they make it possible to make tighten partitioning towards the constitution of pure groups, exploit little the performances of the algorithms.

In our case, each value of the variable of segmentation makes it possible to produce a sheet (cf 3 values of sunning produces 3 tops), it is the case of the C4.5 algorithm for example. The algorithms of training can differ on this point. Some such as CART produce binary trees systematically, he thus seeks during the segmentation the binary regrouping which optimizes the criterion of segmentation. Others such as CHAID seek to carry out the most relevant regroupings while being based on statistical criteria. According to the technique, we will obtain more or less broad trees, the idea is to avoid splitting the data exaggeratedly in order not to produce too weak groups of manpower, corresponding to no statistical reality.

Treatment of the continuous variables

The treatment of the continuous variables must be in agreement with the use of the criterion of segmentation. In the large majority of the cases, the principle of segmentation of the continuous variables is very simple: to sort the data according to the variable to treat, test all the possible points of cut located between two successive points and to evaluate the value of the criterion in each case. The optimal point of cut corresponds quite simply to that which maximizes the criterion of segmentation.

In our example, the algorithm thus evaluated the points of cut located halfway of observations 1 to 5 correspondents with the values {70, 70,80,85,95} of Moisture.

To define the good size of the tree

The objective being to produce homogeneous groups, it appears natural to fix like regulates stop of construction of the tree the constitution of pure groups from the point of view of the variable to be predicted. It is the case in our example above, none the sheets of the tree does not comprise counterexamples. Desirable in theory, this attitude is not bearable in practice.

Indeed, we often work on a sample which one hopes for representative of a population. All the stake is thus to find the good measure between collecting the useful information, really corresponding to a relation in the population, and introducing specificities of the file on which we are working (the sample known as of training), corresponding in fact to a statistical artefact. Said Autement, it never should be forgotten that the performance of the tree is evaluated on the same data which were used for its construction: the more complex the model is (the larger the tree is, the more it has branches, the more it has sheets), the more one runs the risk to see this model unable to be extrapolated with new data, i.e. to give an account of the reality which we precisely try to apprehend. In extreme cases, and it is particularly true decision trees, a model is able to retort any sample of data exactly, without to apprehend any reality … Indeed, so in an extreme case one decides to make further push our tree possible, we can obtain a tree made up of as many sheets as individuals in the sample training. Our tree then does not make any error on this sample since he marries all the characteristics of them: performance equalizes at 100%. Since one applies this model on new data which by nature do not have toutse the characteristics of our sample of training (it acts simply of another sample) its performance thus will be degraded for in extreme cases approaching 0%…!

Thus, when one builds a decision tree, one risks what one calls surajustement a model: the model seems powerful (its average error is very weak) but it is not it actually at all! It we will be supposed to find the tree smallest possible having the greatest possible performance. Plus one tree is small and more it will be stable in its future forecasts (in statistics, the principle of parsimony almost always prevails); the more powerful one tree is, the more it is satisfactory for the analyst. It is not used for nothing to generate a model of good quality very, if this quality is not constant and is degraded when one applies this model to a new whole of data.

In other words, to avoid an on-adjustment of our trees (and it is also true of all the mathematical models which one could build), it is advisable to apply a principle of parsimony and to carry out arbitrations performance/ complexity of the models used. With comparable performance, one will always prefer the simplest model, if one wishes to be able to use this model on new completely unknown data.

The problem of on-adjustment of the models

How to carry out this arbitration performance/ complexity of the models used? Initially, by evaluating the performance of one or several models on the data which were used for its construction (samples of training), but also on what one calls sample one (or several) of test, i.e. data at our disposal but which we voluntarily decide not to use in the construction of our models. All occurs as if these data of test were new data, reality. It is in particular the stability of the performance of our models on these two types of sample will enable us to judge his potential on-adjustment and thus of its capacity to being used with a risk of error controls under real conditions where the data are not known in advance…

In the graph opposite, we observe the evolution of the error of adjustment of a decision tree according to the number of sheets of the tree (complexity). We note that if the error decrease constantly on the sample of training, starting from a certain level of complexity (of very many branches and very many sheets) this model moves away from reality, reality which one tries to measure or in any case to approach on a second sample of data (sample test in the graph).

In the case of the decision trees, several types of algorithmic solutions were planned to try to avoid as much as possible a problem of potential on-adjustment of the models: it is of the techniques known as the pre one or post élégage of the trees.

Certain statistical theories (see work of the Russian mathematician Vladimir Vapnik) go even until having the aim of finding the optimum between the error made on the sample of training and that made on the sample of test. The theory of Vapnik Chervonenkis, “Structured Risk Minimization (SRM)”, by using a variable called “VC dimension”, provides a perfect mathematical modeling of the determination of the optimum of a model, usable consequently to generate models which ensure the best compromise between quality and robustness of the model.

In all the cases, these algorithmic solutions are complementary to performance analyzes compared and stability carried out on the samples of training and test.

Pre-pruning

The first strategy usable surajustement to avoid one massive of the decision trees consists in proposing criteria of stop at the time of the phase of expansion. It is the principle of pre-pruning . We consider for example that a segmentation is not necessary any more when the group is of manpower too weak; or, when the purity of a top reached a sufficient level, we consider that it is not necessary any more to segment it; another criterion often met within this framework, the use of a statistical test to evaluate if the segmentation introduces a significant contribution of information as for the prediction of the values of the variable to be predicted.

Post-pruning

The second strategy consists with to build the tree in two times : to produce the purest possible tree in a phase of expansion by using a first fraction of the sample of data (sample of training not to be confused with the totality of the sample, in English “growing set” is less ambiguous); then to carry out a step back to reduce the tree, it is the phase of post-pruning , while being based on another fraction of the data so as to optimize the performances of the tree. According to the software, this second portion of the data is indicated by the term of sample of validation or sample of test, introducing a confusion with the sample used to measure the performances of the models. The term which makes it possible to indicate it without ambiguity is “sample of pruning”, direct translation of Anglo-Saxon name “pruning set”.

Assignment of the conclusion on each sheet

Once the built tree, it is necessary to specify the rule of assignment in the sheets. A priori, if they are pure, the answer is obvious. If it is not the case, a simple rule is to decide like conclusion of the sheet the majority class, that which is represented.

This very simple technique is the optimal procedure within a very precise framework: the data result from a simple random pulling in the population; the matrix of the costs of bad assignment is unit (symmetrical) càd to affect advisedly does not cost anything (cost equal to 0), and to affect wrongly costs 1 whatever the case of figure. Apart from this groundwork, the rule of the majority is not justified. It remains nevertheless that it is a reference frame easy to use in practice.

Various possible algorithms

From these 4 points, there exists a very great number of alternatives. Some lay the stress on such or such aspect of the algorithm of construction, but there does not exist method which is in practice systematically more powerful. Here a nonexhaustive list of the algorithms classically used:

  • ID3
  • CHAID
  • CART
  • C4.5
  • C5
  • Exhaustive CHAID
  • QUEST

These algorithms are characterized by the criteria of segmentation used, by the implemented methods of élégages… but also by their manner of managing the given missing in our predictors. This point takes all its importance in practice. Indeed, the data at disposal in company for example are quasi systematically incomplete. What to make in the presence of missing data?

  • to be unaware of : that is possible that if sample of data is sufficiently large to remove individuals (lines of database), and that if one is sure that when the decision tree is used in practice (when it is deployed), all the data are always available for all the individuals.
  • to replace by a computed value considered to be adequate (one speaks about charge of missing values): this technique is sometimes used in the world of the statistics but beyond the purely mathematical problems that it can pose it runs up against a stone of methodological obstacle.
  • To use variable subsituts : that consists, for an individual who would have a missing data for a variable selected by the tree as being discriminating, to use the variable which among the whole of the variables available in the database locally produces the sheets most similar to the sheets produced by the variable whose data is missing. This variable is called a substitute. If an individual has a missing value for the initial variable, but also for the variable substitute, one can use a second variable substitute. And so on, until the limit of a quality standard of the substitute. This technique with the advantage of exploiting the whole of information available (that is thus very useful when this information is complex to recover but is unfortunately incomplete) for each individual.

The trees knew a Net renewed interest when the methods of aggregation of the classifieurs such as the Boosting or the bagging were developed and popularized in the community of the machine Learning. Some of the characteristics of the trees, in particular them very marked variability , enable them to benefit from the advantages of the combination of the predictors. Specific techniques were even developed to produce trees individually not very effective, but when they are combined, prove very powerful, it is the case in particular of Random Forests (forests of trees) of Breiman.

In processes of Data mining the decision trees are sometimes combined with more traditional techniques of modeling of a laid down objective: analyzes discriminating, logisitic regressions (logistic regression), linear regressions (linear regression), networks of neurons (will perceptrons multi-layer, radial basis function network…)… They also are very often combined between them to benefit from their respective advantages. Procedures of aggregation of the performances of the various models used, sometimes called procedures of vote, are then installation to obtain a maximum performance, while controlling and by minimizing the level of complexity of the whole of the models used.

Generalization with other problems

The decision trees such as they were presented here are dedicated to the supervised training where one tries to predict (to explain) the values taken by a discrete variable (To be sick or not, Répondre positively an promotional offer or not, etc) starting from a series of discriminating variables of unspecified type. Actually, the step can be wide with other types of problems.

When the target variable is continuous, we are located within the framework of the regression (the most popular method within this framework is the linear Regression). The same diagram of exploration can be applied, except that instead of optimizing the error rate, we optimize the quadratic error, and the criterion of segmentation becomes the decomposition of the variance: we choose the variable which maximizes the variance between classes. One speaks in this case about trees about regression, method CART, in his version RT (Tree Regression) is most known.

When we deal with several target variables, we are located in a problem of automatic Classification (or of typology). One speaks in this case about trees about classification (Clustering Tree in English). The criterion of evaluation of the partitions is the inertia, which is neither more nor less than one generalization of the variance within a multidimensional framework. The most known methods are due to Chavent (1998, the Methods Divisives Monothétiques) and Blockeel (1998, Predictive Clustering Tree). It seems that the typologies suggested by these techniques are also good, in terms of explained inertia, that those produced by the traditional approaches (Ascending hierarchical clustering, dynamic Nuées, etc), with the legibility of the results moreover.

References

  • L. Breiman, J. Friedman, R. Olshen, C. Stone: CART: Classification and Regression Trees , Wadsworth International, 1984.

  • R. Quinlan: C4.5: Programs for Machine Learning , Morgan Kaufmann Publishers Inc., 1993.
  • D. Zighed, R. Rakotomalala: Graphs of Induction -- Training and Dated Mining , Hermes, 2000.
  • Daniel T. Larose (French adaptation T. Vallaud): Of the data to knowledge: An introduction to dated-mining (1Cédérom), Vuibert, 2005.

External bonds

  • Decision trees, Tutoriel on line, Review MODULAD n°33.

  • Manual of statistics on line, (in English).

Software

All the principal commercial software of statistical analysis or Data Mining integrates a graphic module of construction of the decision trees. The offer is plethoric concerning the free software or free.
  • SPSS, Clementine, SAS, SAS Enterprise To mine, ALICE, SPAD, STATISTICA, SYSTAT

  • SIPINA, a free software of construction of the decision trees.
  • Bonds towards other software of induction of the decision trees.

Certain business packages of campaigns marketing and/or optimization of campaigns marketing (it is also the case of other functional problems) integrate in particular algorithms of construction of decision tree (thus very often that heuristic functional calculuses), in a completely transparent way, and make from there the use directly operational:

  • Neolane, Predictive Marketing, SAS Marketing Automation

Today, majority of the large commercial editors of relational databases (relational Database, Oracle (database), Microsoft SQL Server…) many algorithms of Data mining embark of which certain decision trees. The idea is to make it possible to the analysts to exploit all the computing power of these databases and the machines waiter on which they are installed.

Random links:Méricourt-in-Vimeu | Roger Bambuck | Foumban | Olivier Rouault | Saint-Exupéry, prince of the pilots

© 2007-2008 speedlook.com; article text available under the terms of GFDL, from fr.wikipedia.org