The logistic regression is a statistical technique which aims at, starting from a file of observations, to produce a model making it possible to predict the values taken by a catégorielle variable, generally binary, starting from a series of continuous and/or binary explanatory variables.

The logistic regression is largely widespread in many fields. One can quote in an not-exhaustive way:

  • In medicine, it makes it possible for example to find the factors which characterize a group of sick subjects compared to healthy subjects.

  • In the field of the insurances, it makes it possible to target a fraction of the customers which will be sensitive at an insurance policy on such or such particular risk.
  • In the banking field, to detect the groups at the risk at the time of the subscription of a credit.
  • In econometrics, to explain a discrete variable. For example, voting intentions with the elections.

The success of the logistic regression rests in particular on the many tools which make it possible to interpret thoroughly the results obtained.

Compared to the techniques known in regression, in particular the linear Regression, the logistic regression is characterized primarily by the fact that the explained variable is catégorielle.

As a method of prediction for catégorielle variable, the logistic regression is completely comparable with the supervised techniques suggested in machine Learning (Decision tree, Réseaux of neurons, etc), or the discriminating Analyze predictive in exploratory statistics. It is in particular possible to put them in competition to choose the model more adapted for a problem of prediction to solve.

Notations, assumptions and estimates

Notations

In what follows, we will note Y \, the variable to be predicted (variable explained), X = (X_1, X_2,…, X_J) \, the predictive variables (explanatory variable).

Within the framework of the binary logistic regression, the variable Y \, takes two possible methods {1, 0} \, . The variables X_j \, are exclusively continuous or binary.

  • to carry out the estimate, we have a sample \ Omega \, of manpower n \, . We note n_1 \, (resp. n_0 \, ) the observations correspondents with the method 1 \, (resp. 0 \, ) of Y \, .

  • P (Y=1) \, (resp. P (Y=0) \, ) is the prior probability so that Y=1 \, (resp. Y=0 \, ). To simplify, we will write p (1) \, (resp. p (0) \, ).

  • p (X/1) \, (resp. p (X/0) \, ) is the conditional distribution of X knowing the value taken by Y \,

  • Lastly, posterior probability to obtain the method 1 \, of Y \, (resp. 0 \, ) knowing the value taken by X \, est represented by p (1/X) \, (resp. p (0/X) \, ).

Fundamental assumption

The logistic regression rests on the following fundamental assumption

\ ln \ frac {p (X/1)}{p (X/0)} = a_0+a_1x_1+… +a_Jx_J

A vast class of distributions answer this specification, the distribution multinormale already seen in linear discriminating Analyze for example, but also of other distributions, in particular those where the explanatory variables are dichotomic (0/1).

Compared to the discriminating analysis always, they are not any more the conditional densities p (X/1) \, and p (X/0) \, which is modelled but the report/ratio of these densities. The restriction introduced by the assumption is less strong.

Model LOGIT

The specification above can be written in a different way. One indicates by the term LOGIT of p (1/X) \, the following expression

\ ln \ frac {p (1/X)}{1-p (1/X)} = b_0+b_1x_1+… +b_Jx_J

  • It acts well of a " régression" because one wants to show a relation of dependence between a variable to be explained and a series of explanatory variables.

  • It acts of a regression " logistique" because the law of probability is modelled starting from a logistic Loi.

Indeed, after transformation of the equation above, we obtain

p (1/X) = \ frac {e^ {b_0+b_1x_1+… +b_Jx_J}} {1+e^ {b_0+b_1x_1+… +b_Jx_J}}

Note:: Equivalence of the expressions

We started from two different expressions to lead to the logistic model. We observe here the agreement between the coefficients a_j \, and b_j \, . Let us take again the LOGIT

\ ln \ frac {p (1/X)}{1-p (1/X)} = \ ln \ frac {p (1/X)}{p (0/X)}

\ ln \ frac {p (1) p (x/1)}{p (0) p (x/0)}

\ ln \ frac {p (1)}{p (0)} + \ ln \ frac {p (x/1)}{p (x/0)}

\ ln \ frac {p (1/X)}{1-p (1/X)}

\ ln \ frac {p (1)}{p (0)} + a_0+a_1x_1+… +a_Jx_J

We note that \ begin {boxes} b_0 = \ ln \ frac {p (1)}{p (0)}+a_0 \ \ b_j = a_j &, J \ Ge 1 \ end {boxes}

Estimate - Principle of the maximum of probability

Starting from a data file, we must estimate the coefficients b_j \, of function LOGIT. It is very rare to have for each possible combination the X_j, (j=1,…, J) \, , even if these variables all are binary, of sufficient observations to lay out of a reliable estimate of the probabilities P (1/X) \, and P (0/X) \, . The method of least squares ordinary is excluded. The solution passes by another approach: the maximization of probability.

The probability of membership of an individual \ Omega \, with a group, which we can also see like a contribution to probability, can be described in the following way

P (Y (\ Omega) =1/X (\ Omega))^ {Y (\ Omega)} \ times - P (Y (\ Omega) =1/X (\ Omega))^ {1 - Y (\ Omega)}

The probability of a sample \ Omega \, is written then:

L = \ prod_ {\ Omega} P (Y (\ Omega) =1/X (\ Omega))^ {Y (\ Omega)} \ times - P (Y (\ Omega) =1/X (\ Omega))^ {1 - Y (\ Omega)}

The parameters \ hat b_j (j=0,…, J) \, which maximize this quantity are the estimators of the maximum of probability of the logistic regression.

The estimate in practice

In practice, the software uses a procedure approached to obtain a satisfactory solution of maximization above. What explains besides why they always do not provide strictly identical coefficients. The results depend on the algorithm used and the precision adopted during the parameter setting of calculation.

In what follows, we note \ beta \, the vector of the parameters to be estimated. The most known procedure is the method Newton-Raphson which is an iterative method of the gradient (see Algorithme of optimization. It is based on the following relation:

\ beta^ {i+1} = \ beta^ {I} - \ left (\ frac {\ partial^2 L} {\ partial \ partial beta \ \ beta'} \ right) ^ {- 1} \ times \ frac {\ partial L} {\ partial \ beta}

  • \ beta^ {I} \, is the current solution at the stage i \, . \ beta^ {0} = (0,…, 0) \, is a possible initialization;

  • \ frac {\ partial L} {\ partial \ beta} \, is the vector of the derivative partial first of probability;
  • \ frac {\ partial^2 L} {\ partial \ partial beta \ \ beta'} \, is the matrix of the derivative partial seconds of probability;
  • the iterations are stopped when the difference between two successive vectors of solutions are negligible .

This last matrix, known as Matrix hessienne, is interesting because its reverse represents the estimate of the matrix of variance covariance of \ beta \, . It will be put in contribution in the various tests of assumptions to evaluate the significativity of the coefficients.

Evaluation

Stamp confusion

The objective being to produce a model making it possible to predict with the most possible precision the values taken by a catégorielle variable Y \, , an approach privileged to evaluate the quality of the model would be to confront the values predicted with the true values taken by Y \, : it is the role of the matrix of confusion. One from of deduced a simple indicator then, the error rate or the rate of bad classification, which is the relationship between the number of bad predictions and the sample size.

When the matrix of confusion is built on the data which were used to work out the model, the error rate is often too optimistic, not reflecting the real performances of the model in the population. So that the evaluation is not skewed, it is advised to build this matrix on a sample with share, said sample of test. In opposition to the sample of training, it will not have taken part in the construction of the model.

The principal interest of this method is that it makes it possible to compare any method of classification and to thus select that which proves to be most powerful vis-a-vis a given problem.

Statistical evaluation of the regression

It is possible to exploit a probabilistic diagram to carry out tests of assumptions on the validity of the model. These tests rest on the asymptotic distribution of the estimators of the maximum of probability.

To check the total significativity of the model, we can introduce a test similar to the evaluation of the multiple linear regression. The null assumption is written H_0: b_1 = b_2 =… = b_J = 0 \, , that one opposes to the alternative assumption H_1 \, : one of the coefficients at least is nonnull

The statistics of the report/ratio of probability are written \ Lambda = 2 \ times \, , it continuation a law of the \ chi^2 \, with J \, degrees of freedom.

  • l (J+1) \, is the logarithm of the probability of the model with the whole of the variables (thus J+1 coefficients by counting the constant) and,

  • l (1) \, the log probability of the model reduced to the only constant.

If the critical probability (the p-been worth ) is lower than the level of significance than one fixed oneself, one can consider that the model is overall significant. Remain to be seen which are variable who play really a part in this relation.

Individual evaluation of the coefficients

If one seeks to test the significant role of a variable. We carry out the following test H_0: b_j=0 \, , against H_1: b_j \ 0 \, .

The statistics of WALD answer this test, it is written W = \ frac {\ hat b^2} {\ hat V (\ hat b)} \, , it follows a law of the \ chi^2 \, with 1 \, degree of freedom.

N.B. : The estimated variance of the coefficient \ hat b_j \, is read in the reverse of the matrix hessienne seen previously.

Evaluation of a block of coefficients

The two tests above are particular cases of the test of significativity of a block of coefficients. They rise from the criterion of the " déviance" who compares probability between the standard model and the saturated model (the model in which we have all the parameters).

The null assumption is written in this case H_0: \ beta (Q) = 0 \, , where \ beta (Q) \, represents a whole of q \, coefficients simultaneously to zero.

The statistics of the test W (Q) = 2 \ times \, follow a law of the \ chi^2 \, with q \, degrees of freedom.

This test can be very useful when we want to test the role of a catégorielle explanatory variable with q + 1 methods in the model. After recodage, we introduce indeed q \, variable indicatrixes in the model. To evaluate the role of the catégorielle variable taken as a whole, whatever the method considered, we must simultaneously test the coefficients associated with the indicating variables.

Other evaluations

Other procedures of evaluation are usually quoted being the logistic regression. We will note inter alia the test of Hosmer-Lemeshow which is based on the “score” (probability of assignment to a group) to order the observations. In that, it approaches other processes of evaluation of the training such as the curves ROCK which are definitely richer of information than the simple matrix of confusion and the associated error rate.

An example

Starting from the available data on the site of the course in logistic line of Regression (Paul-Marie Bernard, University of Quebec - Chapter 5), we built a model of prediction which aims at explaining the “Weak Weight (Yes/Non)” of one baby to the birth. The explanatory variables are: SMOKE (the fact of smoking or not during the pregnancy), PREM (historical the premature ones with the later childbirth), HT (historical of hypertension), VISIT (many visits in the doctor during first quarter of pregnancy), OLD (age of the mother), PDSM (weight of the mother during the periods of the last menstruations), SCOL (level of schooling of the mother: =1: <12 years, =2: 12-15 years, =3: >15 years).

All the explanatory variables were considered continuous in this analysis. In certain cases, SCOL for example, it would be can be more judicious to code them in indicating variables.

Reading of the results

The results are consigned in the following table.

  • In the matrix of confusion, we read that on the data in training, the model of prediction realizes 10 + 39 = 49 bad predictions. The error rate in resubstitution is of 49/190 = 25,78%
  • the statistics of the report/ratio of probability LAMBDA is equal to 31.77, the probability criticizes associated is 0. The model is thus overall very significant, there exists well a relation between the explanatory variables and the explained variable.
  • By individually studying the coefficients related to each explanatory variable, with the risk of 5%, we note that SMOKES, PREM and HT are harmful with the weight of the baby to the birth (a weak weight of the baby involves); PDSM and SCOL on the other hand seem to play in the direction of a higher weight of the baby. VISIT and OLD do not seem to play of significant role in this analysis.

This first analysis can be refined while carrying out a selection of variables, by studying the concomittant role of certain variables, etc the success of the logistic regression rests precisely mainly on the multiplicity of the tools for interpretations which she proposes. With the concepts of odds, of odds ratios and relative risk, calculated on the dichotomic, continuous variables or on combinations of variables, the statistician can analyze causalities finely and highlight the factors which really weigh on the variable to explain.

Deployment

To classify a new individual \ Omega \, , we must observe the rule of Bayes:

Y (\ Omega) =1 \, if P (Y (\ Omega) =1/X (\ Omega)) > P (Y (\ Omega) =0/X (\ Omega))\,

Who is equivalent to

Y (\ Omega) =1 \, if P (Y (\ Omega) =1/X (\ Omega)) > 0.5 \,

If we consider function LOGIT, this procedure amounts being based on the rule of assignment:

Y (\ Omega) =1 \, if \ hat b_0 + \ hat b_1 \ times X_1 (\ Omega) +… + \ hat b_J \ times X_J (\ Omega) > 0 \,

Let us take the following observation X (\ Omega) \, = (SMOKES = 1 “yes”; PREM = 1 “premature in the history of the mother”; HT = 0 “not”; VISIT = 0 “step of visit in the doctor during first quarter of pregnancy”; OLD = 28; PDSM = 54.55; SCOL = 2 “between 12 and 15 years”).

By applying the equation above, we find 2.893 + 0.853 \ times 1 + 0.691 \ times 1 + 1.744 \ times 0 + 0.030 \ times 0 - 0.028 \ times 28 - 0.038 \ times 54.55 - 0.660 \ times 2 = 0.28125. The model thus predicted a low-weight baby for this person.

What is justified since it is about the observation n°131 of our file, and she gave place indeed to the birth of a low-weight child.

Rectification

The rule of assignment above is valid if the sample is resulting from a pulling randomly in the population. It is not always the case. In many fields, we fix as a preliminary manpower of the classes Y=1 \, and Y=0 \, , then we proceed to the collection of the data in each group. One then speaks about retrospective pulling . It is consequently necessary to carry out a rectification. If the coefficients associated with the variables with the function logit are not modified, the constant on the other hand must be corrected by taking account of manpower in each class (n_1 \, and n_0 \, ) and of the true prior probabilities p (1) \, and p (0) \, (cf references below).

Alternatives

The logistic regression applies directly when the explanatory variables are continuous or dichotomic. When they are catégorielles, it is necessary to proceed to a recodage. Simplest is the binary coding. Let us take the example of a variable habitat take three methods {city, periphery, others}. We will create two binary variables then: “habitat_ville”, “habitat_periphery”. The last method results from both others, when the two variables take simultaneously value 0, that indicates that the observation corresponds to “habitat = others”.

Lastly, it is possible to carry out a logistic regression to predict the values of a catégorielle variable comprising K (K > 2) methods. One speaks about polytomic logistic regression. The procedure rests on the designation of a group of reference, it then produces (K-1) linear combinations for the prediction. The interpretation of the coefficients is less obvious in this case.

References

  • Mr. Bardos, Discriminating Analysis - Application to the financial risk and scoring , Dunod, 2001. (chapter 3)

  • J.P. Nakache, J. Confais, Statistical Explanatory Applied , Technip, 2003 (Part 2)
  • Hosmer D.W., Lemeshow S., Applied logistic regression , Wiley Series in Probability and Mathematical Statistics, 2000
  • Kleinbaum D.G., Logistic regression. With coil-learning text , Springer-Verlag, 1994.
  • Kleinbaum D.G., Kupper L.L., Muller E.M., multivariate Applied regression analysis and other methods , PWS-KENT Publishing Compagny, Boston, 1988.
  • Bouyer J., Hémon D., Rope-maker S., Derriennic F., Stücker I., Stengel B., Clavel J., Epidemiology - Principles and quantitative methods , the Editions INSERM, 1993
  • Bernard, P. - Mr., " Analyzes tables of contingency in épidémiologie" , Presses of the University of Quebec, 2004

Software

  • TANAGRA, a free software for teaching and research.

Random links:Marc Aurèle | Warsy | Fad Gadget | Jean-François Gayraud | Pygoplites diacanthus