Correlation (mathematics)

In Probabilities and Statistical, to study the correlation between two or several random variable or statistics, it is to study the intensity of the connection which can exist between these variables. The required connection is a relation closely connected. In the case of two variables, it is about the linear Regression.

A measurement of this correlation is obtained by the calculation of the linear coefficient of correlation . This coefficient is equal to the report/ratio of their Covariance and the product not no one of their standard deviations (in English standard deviations ). The coefficient of correlation lies between -1 and 1.

Right-hand side of correlation

See also: linear Regression

To calculate the coefficient of correlation between 2 variables amounts seeking to summarize the connection which exists between the variables using a line. One speaks then about a linear adjustment.

How to calculate the characteristics of this line? By making so that the error that one makes by representing the connection between our variables by a line is smallest possible. The formal criterion generally used, but not the only possible one, is to minimize the sum of all the errors actually made squared. One then speaks about adjustment according to the method of the Least ordinary squares. The line resulting from this adjustment is called a straight regression line. The better total quality of representation of the connection between our variables by this line is, and the more the linear coefficient of correlation associated is also. There exists a formal equivalence between the two concepts.

Coefficient of correlation

Formulate

r_p = \ frac {\ sigma_ {xy}} {\ sigma_x \ sigma_y}

For example, we will calculate the coefficient of correlation between two of the same series length (typical case: a regression). It is supposed that one has the following tables of values: X (x_1, \ ldots, x_n) and Y (y_1, \ ldots, y_n) for each of the two series. Then, to know the coefficient of correlation binding these two series, the following formula is applied:

r_p = \ dfrac {\ displaystyle \ sum_ {i=1} ^N (x_i - \ bar X) \ cdot (y_i - \ bar there)}{\ sqrt {\ displaystyle \ sum_ {i=1} ^N (x_i - \ bar X) ^2} \ cdot \ sqrt {\ displaystyle \ sum_ {i=1} ^N (y_i - \ bar there) ^2}}

If R is worth 0, the two curves are not correlated. The two curves of as much are correlated better than R is close to 1.

with:

\ sigma xy = \ frac {1} {NR} {\ sum_ {i=1} ^N (x_i - \ bar X) \ cdot (y_i - \ bar there)}

where \ sigma_x = \ sqrt {\ dfrac {1} {NR} \ displaystyle \ sum_ {i=1} ^N (x_i - \ bar X) ^2} is the standard deviation of X

and \ sigma_y = \ sqrt {\ dfrac {1} {NR} \ displaystyle \ sum_ {i=1} ^N (y_i - \ bar there) ^2} is the standard deviation of Y

\ bar X = \ dfrac {1} {NR} {\ displaystyle \ sum_ {i=1} ^N x_i} is the average of X and \ bar there = \ dfrac {1} {NR} {\ displaystyle \ sum_ {i=1} ^N y_i} is the average of Y

average:

Is x_i the value of the variable for the individual i.
\ sum_ {i=1} ^N x_i is the sum of the NR values where NR indicates the number of individuals.

\ bar X = \ dfrac {x_1+x_2+. +x_n} {NR} = \ dfrac {1} {NR} {\ displaystyle \ sum_ {i=1} ^N x_i}

Interpretation

It is equal to 1 if one of the variables is function closely connected increasing of the other variable, to -1 if the function closely connected is decreasing. The intermediate values inform about the linear degree of dependence between the two variables. More the coefficient is close to the extreme values -1 and 1, more the correlation between the variables is strong; one employs simply the expression “strongly correlated” to qualify the two variables. A correlation equal to 0 means that the variables are linearly independent.

The coefficient of correlation is not sensitive to the units of each one of our variables. Thus for example, the linear coefficient of correlation between the age and the weight of an individual will be identical that the age is measured in week, month or year (S).

On the other hand, this coefficient of correlation will be very sensitive to the presence of aberrant and/or extreme values in our whole of data (values very far away from the majority of the others, being able to be regarded as exceptions).

Dependence

Attention, it is always possible to calculate a coefficient of correlation (except very particular case) but such a coefficient always does not manage to give an account of the relation which actually exists between the studied variables. Indeed, it supposes that one tries to judge existence of a linear relation between our variables. It is thus not adapted to judge correlations which would not be linear and not linéarisables. It also loses its interest when the studied data are very heterogeneous since it represents an average relation and that one knows that the Moyenne always does not have a direction, in particular if the distribution of the data is multi modal.

If the two variables are completely independent, then their correlation is equal to 0. The Réciproque is however false, because the coefficient of correlation indicates only one linear dependence . Other phenomena, for example, can be correlated in manner Exponentielle, or in the form of power (see statistical Série with two variables in elementary Mathématiques).

Let us suppose that the random variable X is uniformly distributed on the interval, and that Y = X2; then Y is completely determined by X, so that X and Y are not independent, but their correlation is worth 0.

These considerations are illustrated for examples in the field of the statistics.

Relation of cause and effect

A current error is to believe that a high coefficient of correlation induced a relation of Causalité between the two measured phenomena. Actually, the two phenomena can be correlated with the same phenomenon-source: a third not measured variable, and on which the two others depend: the number of sunstrokes observed in a Seaside resort, for example, can be thus strongly correlated sold sunglasses; but none of the two phenomena is probably the cause of the other.

Precautions to be taken

Generally, the study of the relation between variables, whatever they are, must be accompanied by descriptive, exhaustive graphs or not in the apprehension of the data at our disposal, to avoid undergoing the purely technical limits of calculations which we use. Nevertheless, as soon as it is a question of being interested in connections between many variables, the charts can more not be possible or be as well as possible illegible. Calculations, like those evoked until now and thus limited by definition, then help us to simplify interpretations which we can give of the bonds between our variables, and it is their principal interest well there. It will then remain to be checked that the principal assumptions necessary to their good reading are validated before any interpretation.

See too

Internal bonds

Simple: Correlation

Random links:Canton of Perpignan-8 | Puos d' Alpago | Carijó | Now & Forever (album) | Vingtenier | Cuijk