Descriptive statistics

The descriptive statistical is the branch of the Statistique which gathers the many techniques used for to describe a relatively important unit of data.

Statistical description

The objective of the descriptive statistics is to describe, i.e. to summarize or represent, by Statistique S, the available data when they are numerous.

Available data

Any description of a phenomenon requires to observe or to know certain things on this phenomenon.

  • the observations available are always made up overall of observation synchronous . For example: a temperature, a pressure and a measurement of density at one moment given in a precise tank. These three synchronous variables can be observed several times (at several dates) in several places (in several tanks).
  • the knowledge available as for them consist of formulas which connect certain variables. For example the law of perfect gases PV = N RT.

Description

It is rather complicated to define the better description possible of a phenomenon. Within the framework of the statistics, it will be a question of providing all information available on the phenomenon in less possible figures and words.

Typically, the law of perfect gases is a very good description of the phenomenon made up of the gas reaction of in a state of balance which one observes that the pressure, the temperature, and volume. The value of the constant R can then be seen like statistics associated with this description.

The question of the visual description also arises, but will put we it temporarily side. The article Visualization of the data, answers it more directly.

Statistical point of view

The statistical point of view on the description of a phenomenon comes from what one considers that the observations available are different demonstrations same abstract phenomenon. To remain on the example of the temperature, the pressure and the density measured in several moments, one will consider that with each time one takes these three measurements, one observes the same phenomenon. Measurements will not be exactly the same ones; it is the distribution of these measurements that we will describe statistically.

Examples

Physical sizes

If one measures time with other the pressure, the temperature and the density of a gas present in a tank, one obtains a collection of triplets of data, indexed by the moment of measurement.

Behavioral or biological sizes

In the medical field, one can for example measure the weight before and after the catch of a drug for several people. One then obtains a collection of couples of data (weight before and afterwards) indexed by the name of the person.

In sociology or marketing one can measure the number of books read per annum for many people, which one in addition knows the age and the level of study. Here also one obtains a collection of triplets of data, indexed by the name of the reader.

Formalization of the practical cases

The various measured sizes are called variable .

The statistical study requires that one takes as assumption that there exists a phenomenon abstracts more or less hidden which implements these variables (and can be the different one).

Each value the index (which can be a date, or a number identifying an individual), identifies then a photography partial of the phenomenon . One calls the values of the variables for a given index of the observations or a realization of the phenomenon.

From a formal point of view, one poses the principle which the abstract phenomenon can comprise of the deterministic elements like random elements (one says also stochastic). The whole of the variables observed are then juxtaposed in the form of a vector of data . There is no more whereas only one variable (but which is multi varied ).

The observations are then many achievements (within the meaning of the statistical mathematics) of this varied random variable multi.

Study of only one variable

Description of a varied mono phenomenon

Let us start with the simplest situation: that of the observation of only one variable (for example pressure in a tank, or the number of books read per annum for a person). As we saw higher, we take as assumption that there exists a penomene of which this variable forms part, that this phenomenon east can be partly random. This random part implies that the variable observed is resulting from an abstract variable subjected partly to an unknown risk.

The observations of which we lay out are then achievements of this abstract random variable.

The objective of the descriptive statistical within this framework is as well as possible to summarize this collection of values by possibly taking support on our assumption (the existence of an abstract random law behind all that).

Exhaustive description

A first remark is that the best possible description of a phenomenon starting from a collection of observations is the collection itself ! Indeed, why become complicated the life to calculate many indicators whereas all is there?

Initially, it should be noted that this remark is far from being stupid, and from a certain point of view, one finds this philosophy behind the nonparametric Statistiques.

But in the second place, it is seen well that it is interesting to summarize these observations. The important matter becomes then: how to summarize them without destroying the information which it contains ?

Simple example

If our observations are the success or the failure of 23 sportsmen to a test of high jump. It will be about a series of " succès" (S), " échec" (E) indexed by the name of the sportsman. Here data: S, S, E, E, E, S, E, S, S, S, E, E, S, E, S, E, S, S, S, S, E, E, S Without reflecting and by using statistical criteria, we can decide to describe this phenomenon as follows:

By allotting a point to each of the 23 sportsmen when it makes a success of his jump, and no when it misses it, the median number of gained point is 0.5652 and the standard deviation of the points gained is 0.5069 .

We will undoubtedly prefer this one:

23 sportsmen jumped, 13 of them succeeded.

This description is simple, clear and short (less than 50 characters).

It is also completely possible to make of it a description which destroys information, for example this one:

By allotting a point to each sportsman when it makes a success of his jump, and no when it misses it, the median number of gained point is 0.5652

Indeed, it misses at least the number of jumpers, which is an important descriptive element.

Of course, if one seeks to describe a particular phenomenon, as this one if I had bet on one of the 23 jumpers, which chances did have I to gain? , the answer would have been different:

57%

much shorter, and not destroying any information within sight of the question. It was not then any more a question of describing the achievements of the phenomenon without particular point of view, but with a quite precise angle. One describes actually another phenomenon (that of the bets).

It is thus very important to answer the put question well , and not to apply formulas done everything without reflecting.

We interest lastly in another question: If I were to bet at the time of a forthcoming test of jump, which would be my chance of profit? .

We could answer 57%, as for the preceding question, but after all, we did not observe that 23 jumpers; is this sufficient to draw a conclusion from it on the perfomances from other jumpers?

In order to bring an answer all the same, let us specify the principal assumption that we will use:

Assumption : the nature of the performances of the jumpers will be the same one as that observed.

That wants to say that if this competition were national, the second will be it also: one will not use observations resulting from a phenomenon of national level with the same phenomenon, but of Olympic level for example.

And even in this framework, if for example we had observed only 2 jumpers, which had all two been successful, that would like it to say that all the jumpers of national level always make a success of (i.e. I have a chance of profit of 100%)? Of course that not.

We must then resort to the concept of interval confidence : the goal is to give an account of the size of our sample of sportsmen, combined with certain probabilistic assumptions.

In fact, the mathematical statistics say to us that an estimator of proportion calculated starting from N observations follows a normal law of variance p (1-p) /N around the theoretical proportion p . In our case: N=23 and p=0.57. This teaches us that under our assumption, there is a probability of 95% that our chance of profit is between 57%-1.96 \ sqrt {57% \ times 43%/23} and 57%+1.96 \ sqrt {57% \ times 43%/23} . The answer is thus finally:

There is 95% of chances that the probability of winning our bet at the time of a similar meeting lies between 36 and 77%

Methodological elements

There exists finally a whole collection of statistics which one can use at descriptive ends. They are criteria which quantify various characteristics of the distribution of the observations:

  • is centered around a value?
  • is grouped around certain values?
  • do they traverse broad beaches of possible values?
  • do they follow of the statistical laws known?
  • etc…

Without a priori on the question which is asked to us, we can review these various descriptive indicators.

Intrinsic description of a distribution of observations

Without no a priori on the question which one puts, some simple statistics make it possible to describe it:

  • the average
  • the median
  • the mode
  • the maximum
  • the minimum
  • the standard deviation (and the variance)
  • of the quantiles

The two first are often named criteria of position , and the others is included rather in the category of the criteria of dispersion .

The average

See also: Average

The arithmetic mean is the sum of the values of the variable divided by the number of individuals: \ bar {X} = \ frac {1} {N} \ cdot \ sum_ {I = 1} ^n x_i

The median

See also: Median (center)

The median is the central value which divides the effective sample in 2 groups in the same way: 50% with the top and 50% in lower part. The median can have a value different from the average. In France, the median wages are lower than the average wages: there are many smicards and few very large wages. However, the large wages draw the average upwards.

In general, a median is, in an array, a value M such as there is as much value higher or equal to M that of value lower or equal to Mr. example: 1 3 5 9 6 4 6 the median equalizes to 9 5 5 6 6 8 8 the median equalizes with (6+6) /2=6

Mode

See also: Mode (statistical)

The mode corresponds to the most frequent realization.

The Variance

See also: Variance (statistics and probabilities)

The corrected empirical variance \ hat {\ sigma} ^2 for the square of the standard deviation (or variance): \ hat {\ sigma} ^2 = \ frac {1} {n-1} \ cdot \ sum_ {I = 1} ^n (x_i - \ bar {X}) ^2

Attention  : the variance (concept of descriptive statistics) equalizes is the simple arithmetic mean of the squares of the variations to the arithmetic mean observed, but the variance without skew (concept of statistics mathematical, which mean that when the sample size of data tend towards the infinite one, the statistics --here the variance-- tend towards its theoretical value) is n/(N - 1) time the variance observed. The variance without skew is thus higher than the variance observed.

Standard deviation

See also: Variation type

\hat\sigma_X : it is the square root of the variance

  • Coefficient of variation : C.V. = \ frac {\ sigma} {\ bar {X}}

Minimum and maximum

  • Wide : it is the interval between smallest and the greatest value. One says of a phenomenon that it presents a “strong dynamics” when the extent (or dispersion) is large.

Confidence intervals

The Loi of the great numbers guarantees that the estimated average \ bar X is at a distance smaller than d of the theoretical average E (X) with a probability P ({Y \ over \ hat \ sigma_X \ sqrt {N}} , where Y continuation a Gaussian distribution. That wants to as say as (q_ \ alpha is the quantile corresponding to \ alpha for Gaussian):

P \ left (E (X) \ in \ left X \ alpha {\ sigma_X \ over \ sqrt {N}}, \ bar X+ \ alpha {\ sigma_X \ over \ sqrt {N}} \ right \ right) = q_ \ alpha

Consequently, when the sample size n increases linearly, the precision of the estimator of the average increases in 1/\ sqrt {N} .

When the whole of n not does not constitute a sample of the population, but the total population, the variance without skew does not have to be used, since one is not any more in one context of estimate but of measurement.

Quantiles

Those generalize the median concept of which cuts the distribution in two equal parts. One defines in particular the Quartile S, Décile S and Centile S (or percentiles) on the population, ordered in the order ascending, which one divides into 4,10 or 100 parts in the same way effective.

One will speak thus about “centile 90” to indicate the value separating the first 90% of the population from the 10% remainder. Thus, in a population of young children, a child whose size or weight is beyond centile 90, or in on this side centile 10, must be the object of a particular follow-up.

Histogram

See also: Histogram

Even if he is regarded by much as a chart, and that he thus has more his place in a description of the methods of Visualization of the data, the histogram is a natural link between an exhaustive representation of the data and description by comparison with known statistical laws.

Empirical distribution

The empirical density of a variable with discrete values is simply made up of the proportion of the observations taking each value.

If the example of the sportsmen is taken again, the empirical density of our population is 57% of success and 43% of failures. The associated histogram is very simple (cf image on the left).

One calls function of empirical distribution associated a series with observations with actual value having the values V_1, \ ldots, V_N the following function:

F^* (v) = \ frac {1} {NR} \ sum_ {n=1} ^N \ mathbf {1} _ {v \ geq V_n}

It is an estimate of the probability that the value dun event of the phenomenon observed has a higher value or equalizes with v .

If one wanted to deduce from it the empirical density associated with the observations, F^* (v) would have to be derived . Being given that the derivative of an indicatrix ( \ mathbf {1} _ {v \ geq V_n} ) is a distribution of Dirac, the result would not be very usable.

Several alternatives are possible:

  • to use an estimator by cores, it acts to implement the following density:
f^* (v) = \ frac {1} {NR} \ sum_ {n=1} ^N K_r (v-V_n) where K is a function core (of mass equal to one).
  • to approximate the density by a function in staircase.

A histogram is best the estimate by a function in staircase of the empirical density. I.e. the integral of the histogram must be nearest possible to F^* (v) . Let us notice that the integral of the histogram is a continuous function closely connected per pieces. From a certain point of view:

to find the function continues closely connected per pieces which approximates best the empirical function of distribution amounts characterizing the histogram completely.

Within this framework, the number of pieces (of classes or bars ) is a very important parameter. It is necessary to resort to an additional criterion if one wants to find his best value possible. One takes for example a criterion with Akaike or the criterion BIRO (Bayesian Information Criterion); it is also possible to resort to a criterion of information or entropy.

By construction, the bars of the histograms are thus not necessarily all of the same width.

Random links:Wandignies-Hamage | Saudrupt | Mr. & Mrs. Smith (televised series) | Alphen-Chaam | ED Byrne | Isomorphisme_(sociologie)