The logistic regression is a statistical technique which aims at, starting from a file of observations, to produce a model making it possible to predict the values taken by a catégorielle variable, generally binary, starting from a series of continuous and/or binary explanatory variables.
The logistic regression is largely widespread in many fields. One can quote in an not-exhaustive way:
In medicine, it makes it possible for example to find the factors which characterize a group of sick subjects compared to healthy subjects.
The success of the logistic regression rests in particular on the many tools which make it possible to interpret thoroughly the results obtained.
Compared to the techniques known in regression, in particular the linear Regression, the logistic regression is characterized primarily by the fact that the explained variable is catégorielle.
As a method of prediction for catégorielle variable, the logistic regression is completely comparable with the supervised techniques suggested in machine Learning (Decision tree, Réseaux of neurons, etc), or the discriminating Analyze predictive in exploratory statistics. It is in particular possible to put them in competition to choose the model more adapted for a problem of prediction to solve.
In what follows, we will note the variable to be predicted (variable explained), the predictive variables (explanatory variable).
Within the framework of the binary logistic regression, the variable takes two possible methods . The variables are exclusively continuous or binary.
to carry out the estimate, we have a sample of manpower . We note (resp. ) the observations correspondents with the method (resp. ) of .
(resp. ) is the prior probability so that (resp. ). To simplify, we will write (resp. ).
(resp. ) is the conditional distribution of X knowing the value taken by
Lastly, posterior probability to obtain the method of (resp. ) knowing the value taken by est represented by (resp. ).
The logistic regression rests on the following fundamental assumption
A vast class of distributions answer this specification, the distribution multinormale already seen in linear discriminating Analyze for example, but also of other distributions, in particular those where the explanatory variables are dichotomic (0/1).
Compared to the discriminating analysis always, they are not any more the conditional densities and which is modelled but the report/ratio of these densities. The restriction introduced by the assumption is less strong.
The specification above can be written in a different way. One indicates by the term LOGIT of the following expression
It acts well of a " régression" because one wants to show a relation of dependence between a variable to be explained and a series of explanatory variables.
Indeed, after transformation of the equation above, we obtain
Note:: Equivalence of the expressions
We started from two different expressions to lead to the logistic model. We observe here the agreement between the coefficients and . Let us take again the LOGIT
We note that
Starting from a data file, we must estimate the coefficients of function LOGIT. It is very rare to have for each possible combination the , even if these variables all are binary, of sufficient observations to lay out of a reliable estimate of the probabilities and . The method of least squares ordinary is excluded. The solution passes by another approach: the maximization of probability.
The probability of membership of an individual with a group, which we can also see like a contribution to probability, can be described in the following way
The probability of a sample is written then:
The parameters which maximize this quantity are the estimators of the maximum of probability of the logistic regression.
In practice, the software uses a procedure approached to obtain a satisfactory solution of maximization above. What explains besides why they always do not provide strictly identical coefficients. The results depend on the algorithm used and the precision adopted during the parameter setting of calculation.
In what follows, we note the vector of the parameters to be estimated. The most known procedure is the method Newton-Raphson which is an iterative method of the gradient (see Algorithme of optimization. It is based on the following relation:
is the current solution at the stage . is a possible initialization;
This last matrix, known as Matrix hessienne, is interesting because its reverse represents the estimate of the matrix of variance covariance of . It will be put in contribution in the various tests of assumptions to evaluate the significativity of the coefficients.
The objective being to produce a model making it possible to predict with the most possible precision the values taken by a catégorielle variable , an approach privileged to evaluate the quality of the model would be to confront the values predicted with the true values taken by : it is the role of the matrix of confusion. One from of deduced a simple indicator then, the error rate or the rate of bad classification, which is the relationship between the number of bad predictions and the sample size.
When the matrix of confusion is built on the data which were used to work out the model, the error rate is often too optimistic, not reflecting the real performances of the model in the population. So that the evaluation is not skewed, it is advised to build this matrix on a sample with share, said sample of test. In opposition to the sample of training, it will not have taken part in the construction of the model.
The principal interest of this method is that it makes it possible to compare any method of classification and to thus select that which proves to be most powerful vis-a-vis a given problem.
It is possible to exploit a probabilistic diagram to carry out tests of assumptions on the validity of the model. These tests rest on the asymptotic distribution of the estimators of the maximum of probability.
To check the total significativity of the model, we can introduce a test similar to the evaluation of the multiple linear regression. The null assumption is written , that one opposes to the alternative assumption : one of the coefficients at least is nonnull
The statistics of the report/ratio of probability are written , it continuation a law of the with degrees of freedom.
is the logarithm of the probability of the model with the whole of the variables (thus J+1 coefficients by counting the constant) and,
If the critical probability (the p-been worth ) is lower than the level of significance than one fixed oneself, one can consider that the model is overall significant. Remain to be seen which are variable who play really a part in this relation.
If one seeks to test the significant role of a variable. We carry out the following test , against .
The statistics of WALD answer this test, it is written , it follows a law of the with degree of freedom.
N.B. : The estimated variance of the coefficient is read in the reverse of the matrix hessienne seen previously.
The two tests above are particular cases of the test of significativity of a block of coefficients. They rise from the criterion of the " déviance" who compares probability between the standard model and the saturated model (the model in which we have all the parameters).
The null assumption is written in this case , where represents a whole of coefficients simultaneously to zero.
The statistics of the test follow a law of the with degrees of freedom.
This test can be very useful when we want to test the role of a catégorielle explanatory variable with methods in the model. After recodage, we introduce indeed variable indicatrixes in the model. To evaluate the role of the catégorielle variable taken as a whole, whatever the method considered, we must simultaneously test the coefficients associated with the indicating variables.
Other procedures of evaluation are usually quoted being the logistic regression. We will note inter alia the test of Hosmer-Lemeshow which is based on the “score” (probability of assignment to a group) to order the observations. In that, it approaches other processes of evaluation of the training such as the curves ROCK which are definitely richer of information than the simple matrix of confusion and the associated error rate.
Starting from the available data on the site of the course in logistic line of Regression (Paul-Marie Bernard, University of Quebec - Chapter 5), we built a model of prediction which aims at explaining the “Weak Weight (Yes/Non)” of one baby to the birth. The explanatory variables are: SMOKE (the fact of smoking or not during the pregnancy), PREM (historical the premature ones with the later childbirth), HT (historical of hypertension), VISIT (many visits in the doctor during first quarter of pregnancy), OLD (age of the mother), PDSM (weight of the mother during the periods of the last menstruations), SCOL (level of schooling of the mother: =1: <12 years, =2: 12-15 years, =3: >15 years).
All the explanatory variables were considered continuous in this analysis. In certain cases, SCOL for example, it would be can be more judicious to code them in indicating variables.
The results are consigned in the following table.
This first analysis can be refined while carrying out a selection of variables, by studying the concomittant role of certain variables, etc the success of the logistic regression rests precisely mainly on the multiplicity of the tools for interpretations which she proposes. With the concepts of odds, of odds ratios and relative risk, calculated on the dichotomic, continuous variables or on combinations of variables, the statistician can analyze causalities finely and highlight the factors which really weigh on the variable to explain.
To classify a new individual , we must observe the rule of Bayes:
if
Who is equivalent to
if
If we consider function LOGIT, this procedure amounts being based on the rule of assignment:
if
Let us take the following observation = (SMOKES = 1 “yes”; PREM = 1 “premature in the history of the mother”; HT = 0 “not”; VISIT = 0 “step of visit in the doctor during first quarter of pregnancy”; OLD = 28; PDSM = 54.55; SCOL = 2 “between 12 and 15 years”).
By applying the equation above, we find . The model thus predicted a low-weight baby for this person.
What is justified since it is about the observation n°131 of our file, and she gave place indeed to the birth of a low-weight child.
The rule of assignment above is valid if the sample is resulting from a pulling randomly in the population. It is not always the case. In many fields, we fix as a preliminary manpower of the classes and , then we proceed to the collection of the data in each group. One then speaks about retrospective pulling . It is consequently necessary to carry out a rectification. If the coefficients associated with the variables with the function logit are not modified, the constant on the other hand must be corrected by taking account of manpower in each class ( and ) and of the true prior probabilities and (cf references below).
The logistic regression applies directly when the explanatory variables are continuous or dichotomic. When they are catégorielles, it is necessary to proceed to a recodage. Simplest is the binary coding. Let us take the example of a variable habitat take three methods {city, periphery, others}. We will create two binary variables then: “habitat_ville”, “habitat_periphery”. The last method results from both others, when the two variables take simultaneously value 0, that indicates that the observation corresponds to “habitat = others”.
Lastly, it is possible to carry out a logistic regression to predict the values of a catégorielle variable comprising K (K > 2) methods. One speaks about polytomic logistic regression. The procedure rests on the designation of a group of reference, it then produces (K-1) linear combinations for the prediction. The interpretation of the coefficients is less obvious in this case.
Mr. Bardos, Discriminating Analysis - Application to the financial risk and scoring , Dunod, 2001. (chapter 3)
TANAGRA, a free software for teaching and research.
| Random links: | Marc Aurèle | Warsy | Fad Gadget | Jean-François Gayraud | Pygoplites diacanthus |