Descriptive statistics
The descriptive statistical is the branch of the Statistique which gathers the many techniques used for to describe a relatively important unit of data.
Statistical description
The objective of the descriptive statistics is to describe, i.e. to summarize or represent, by Statistique S, the available data when they are numerous.
Available data
Any description of a phenomenon requires to observe or to know certain things on this phenomenon.
- the observations available are always made up overall of observation synchronous . For example: a temperature, a pressure and a measurement of density at one moment given in a precise tank. These three synchronous variables can be observed several times (at several dates) in several places (in several tanks).
- the knowledge available as for them consist of formulas which connect certain variables. For example the law of perfect gases .
Description
It is rather complicated to define the better description possible of a phenomenon. Within the framework of the statistics, it will be a question of providing all information available on the phenomenon in less possible figures and words.
Typically, the law of perfect gases is a very good description of the phenomenon made up of the gas reaction of in a state of balance which one observes that the pressure, the temperature, and volume. The value of the constant can then be seen like statistics associated with this description.
The question of the visual description also arises, but will put we it temporarily side. The article Visualization of the data, answers it more directly.
Statistical point of view
The statistical point of view on the description of a phenomenon comes from what one considers that the observations available are different demonstrations same abstract phenomenon. To remain on the example of the temperature, the pressure and the density measured in several moments, one will consider that with each time one takes these three measurements, one observes the same phenomenon. Measurements will not be exactly the same ones; it is the distribution of these measurements that we will describe statistically.
Examples
Physical sizes
If one measures time with other the pressure, the temperature and the density of a gas present in a tank, one obtains a collection of triplets of data, indexed by the moment of measurement.
Behavioral or biological sizes
In the medical field, one can for example measure the weight before and after the catch of a drug for several people. One then obtains a collection of couples of data (weight before and afterwards) indexed by the name of the person.In sociology or marketing one can measure the number of books read per annum for many people, which one in addition knows the age and the level of study. Here also one obtains a collection of triplets of data, indexed by the name of the reader.
Formalization of the practical cases
The various measured sizes are called variable .The statistical study requires that one takes as assumption that there exists a phenomenon abstracts more or less hidden which implements these variables (and can be the different one).
Each value the index (which can be a date, or a number identifying an individual), identifies then a photography partial of the phenomenon . One calls the values of the variables for a given index of the observations or a realization of the phenomenon.
From a formal point of view, one poses the principle which the abstract phenomenon can comprise of the deterministic elements like random elements (one says also stochastic). The whole of the variables observed are then juxtaposed in the form of a vector of data . There is no more whereas only one variable (but which is multi varied ).
The observations are then many achievements (within the meaning of the statistical mathematics) of this varied random variable multi.
Study of only one variable
Description of a varied mono phenomenon
Let us start with the simplest situation: that of the observation of only one variable (for example pressure in a tank, or the number of books read per annum for a person). As we saw higher, we take as assumption that there exists a penomene of which this variable forms part, that this phenomenon east can be partly random. This random part implies that the variable observed is resulting from an abstract variable subjected partly to an unknown risk.
The observations of which we lay out are then achievements of this abstract random variable.
The objective of the descriptive statistical within this framework is as well as possible to summarize this collection of values by possibly taking support on our assumption (the existence of an abstract random law behind all that).
Exhaustive description
A first remark is that the best possible description of a phenomenon starting from a collection of observations is the collection itself ! Indeed, why become complicated the life to calculate many indicators whereas all is there?
Initially, it should be noted that this remark is far from being stupid, and from a certain point of view, one finds this philosophy behind the nonparametric Statistiques.
But in the second place, it is seen well that it is interesting to summarize these observations. The important matter becomes then: how to summarize them without destroying the information which it contains ?
Simple example
If our observations are the success or the failure of 23 sportsmen to a test of high jump. It will be about a series of " succès" (S), " échec" (E) indexed by the name of the sportsman. Here data: S, S, E, E, E, S, E, S, S, S, E, E, S, E, S, E, S, S, S, S, E, E, S Without reflecting and by using statistical criteria, we can decide to describe this phenomenon as follows:-
By allotting a point to each of the 23 sportsmen when it makes a success of his jump, and no when it misses it, the median number of gained point is 0.5652 and the standard deviation of the points gained is 0.5069 .
We will undoubtedly prefer this one:
-
23 sportsmen jumped, 13 of them succeeded.
This description is simple, clear and short (less than 50 characters).
It is also completely possible to make of it a description which destroys information, for example this one:
-
By allotting a point to each sportsman when it makes a success of his jump, and no when it misses it, the median number of gained point is 0.5652
Indeed, it misses at least the number of jumpers, which is an important descriptive element.
Of course, if one seeks to describe a particular phenomenon, as this one if I had bet on one of the 23 jumpers, which chances did have I to gain? , the answer would have been different:
-
57%
much shorter, and not destroying any information within sight of the question. It was not then any more a question of describing the achievements of the phenomenon without particular point of view, but with a quite precise angle. One describes actually another phenomenon (that of the bets).
It is thus very important to answer the put question well , and not to apply formulas done everything without reflecting.
We interest lastly in another question: If I were to bet at the time of a forthcoming test of jump, which would be my chance of profit? .
We could answer 57%, as for the preceding question, but after all, we did not observe that 23 jumpers; is this sufficient to draw a conclusion from it on the perfomances from other jumpers?
In order to bring an answer all the same, let us specify the principal assumption that we will use:
- Assumption : the nature of the performances of the jumpers will be the same one as that observed.
That wants to say that if this competition were national, the second will be it also: one will not use observations resulting from a phenomenon of national level with the same phenomenon, but of Olympic level for example.
And even in this framework, if for example we had observed only 2 jumpers, which had all two been successful, that would like it to say that all the jumpers of national level always make a success of (i.e. I have a chance of profit of 100%)? Of course that not.
We must then resort to the concept of interval confidence : the goal is to give an account of the size of our sample of sportsmen, combined with certain probabilistic assumptions.
In fact, the mathematical statistics say to us that an estimator of proportion calculated starting from observations follows a normal law of variance around the theoretical proportion . In our case: and .
This teaches us that under our assumption, there is a probability of 95% that our chance of profit is between and . The answer is thus finally:
There is 95% of chances that the probability of winning our bet at the time of a similar meeting lies between 36 and 77%
Methodological elements
There exists finally a whole collection of statistics which one can use at descriptive ends. They are criteria which quantify various characteristics of the distribution of the observations:
- is centered around a value?
- is grouped around certain values?
- do they traverse broad beaches of possible values?
- do they follow of the statistical laws known?
- etc…
Without a priori on the question which is asked to us, we can review these various descriptive indicators.
Intrinsic description of a distribution of observations
Without no a priori on the question which one puts, some simple statistics make it possible to describe it:
- the average
- the median
- the mode
- the maximum
- the minimum
- the standard deviation (and the variance)
- of the quantiles
The two first are often named criteria of position , and the others is included rather in the category of the criteria of dispersion .
The average
See also: Average
The arithmetic mean is the sum of the values of the variable divided by the number of individuals:
The median
See also: Median (center)
The median is the central value which divides the effective sample in 2 groups in the same way: 50% with the top and 50% in lower part. The median can have a value different from the average. In France, the median wages are lower than the average wages: there are many smicards and few very large wages. However, the large wages draw the average upwards.
In general, a median is, in an array, a value M such as there is as much value higher or equal to M that of value lower or equal to Mr. example: 1 3 5 9 6 4 6 the median equalizes to 9 5 5 6 6 8 8 the median equalizes with (6+6) /2=6
Mode
See also: Mode (statistical)
The mode corresponds to the most frequent realization.
The Variance
See also: Variance (statistics and probabilities)
The corrected empirical variance for the square of the standard deviation (or variance):
Attention : the variance (concept of descriptive statistics) equalizes is the simple arithmetic mean of the squares of the variations to the arithmetic mean observed, but the variance without skew (concept of statistics mathematical, which mean that when the sample size of data tend towards the infinite one, the statistics --here the variance-- tend towards its theoretical value) is time the variance observed. The variance without skew is thus higher than the variance observed.
Standard deviation
See also: Variation type
: it is the square root of the variance
- Coefficient of variation :
Minimum and maximum
- Wide : it is the interval between smallest and the greatest value. One says of a phenomenon that it presents a “strong dynamics” when the extent (or dispersion) is large.