The training by reinforcement refers to a class of problems of machine Learning, from which the goal is to learn, starting from experiments, which it is advisable to do in various situations, in order to optimize a numerical reward during time.

A traditional paradigm to present the problems of training by reinforcement consists in considering an autonomous agent, plunged within an environment, and which must make decisions according to its current state. In return, the environment gets for the agent a reward, which can be positive or negative.

The agent seeks, through reiterated experiments, a decisional behavior (called political strategy or , and which is a function associating with the state running the action to be carried out) optimal, in the sense that it maximizes the sum of the rewards during time.

History and bond with biology

Among the first algorithms of training by reinforcement, one counts the TD-learning, proposed by Richard Sutton in 1988, and the Q-learning developped at the point primarily at the time of a supported thesis in 1989 and really published in 1992.

However, the origin of the training by reinforcement is older. It derives from theoretical formalizations of control methods optimal, aiming at developing a controller making it possible to minimize with the court of time a given measurement of the behavior of a dynamic system. The discrete and stochastic version of this problem is called a Markovian Decision-making process and was introduced by Bellman in 1957.

In addition, the formalization of the problems of training by reinforcement took as a starting point theories of animal psychology also much, as those analyzing how an animal can learn how by test-errors to adapt to its environment. These theories inspired much the scientific field of the Artificial intelligence and contributed much to the emergence of algorithms of training by reinforcement to beginning of the year 80.

It is interesting to note that in return, the current refinement of the algorithms of training by reinforcement inspires work of the neurobiologists and the psychologists for the comprehension of the operation of the brain and the animal behavior. Indeed, collaboration between neurobiologists and researchers in Artificial intelligence made it possible to discover that part of the brain functioned in a way very similar to the algorithms of training by reinforcement such as the TD-learning. It would seem thus that nature discovered, with the wire of the Evolution, a way similar to that found by researchers to optimize the way in which an agent or organization can learn by test-errors. Or rather, the researchers in Artificial intelligence have redécouvert partly what nature had put of the million years to set up. Indeed, the zone of the brain which shows analogies with the algorithms of training by reinforcement calls the Ganglions of the base, of which a under-part called the black Substance emits a Neuromodulateur, the Dopamine, which chemically reinforces synaptic connections between the Neurons. This operation of the Ganglions of the base was identified like existing at the whole of the vertebrate ones, and one finds the same kind of results in Medical imagery at the man.

Lastly, the loop of scientific exchange between neurobiologists, psychologists and researchers in Artificial intelligence is not finished since currently, of the researchers take inspiration of the brain to refine the algorithms of training by reinforcement and to thus try to develop more autonomous and adaptive robots that those existing. Indeed, even if nature and the researchers seem separately to have found the same solution to solve certain types of problems such as those described in the preceding paragraph, one realizes well that the intelligence of the current robots is still well far from that of the man or even of that of many animals such as the monkeys or the rodents. A promising way to mitigate that is to analyze more in details how the biological brain paramétrise and structure anatomically of the processes such as the training by reinforcement, and how it integrates these processes with other cognitive functions such as perception, the space orientation, planning, the memory, and others in order to reproduce this integration in the artificial brain of a robot.

Formalism

Formally, the base of the model of training by reinforcement consists of:

1. a whole of states S of the agent in the environment; 2. a whole of actions has that the agent can carry out; 3. a whole of scalar values " récompenses" R which the agent can obtain.

With each step of time T of the algorithm, the agent perceives its state St \ in S and the whole of the possible actions has (St). It chooses an action has \ in has (St) and receives environment a new state st+1 and a reward rt+1 (which is null most of the time and is worth classically 1 in certain key states of the environment). Based on these interactions, the algorithm of training by reinforcement must make it possible the agent to develop a policy π: S \ rightarrow has which enables him to maximize the quantity of rewards. The latter is in the case of written R=r0+r1+… +rn the Markovian decision-making processes (MDP) which have a final state, or R=Σtγtrt for MDPs without final state (where γ is a factor of devaluation ranging between 0 and 1 and allowing, according to her value, to take into account the rewards more or less far in the future for the choice of the actions of the agent).

Thus, the method of the training by reinforcement is particularly adapted to the problems requiring a compromise between the search of short-term rewards and that of long-term rewards. This method was applied successfully to varied problems, such as robot-like control, the reversed pendulum, the planning of tasks, telecommunications, the Backgammon and the failures.

Algorithm

One gives here a basic algorithmic version of the training by reinforcement, particularly of the TD-learning. The purpose of this version is to make it possible to the interested reader to make a data-processing rapid implementation of the algorithm in order to in better including/understanding iterative operation. One places oneself here in the simplest case where one does not seek to improve the behavior (or policy) of the agent, but one seeks to evaluate a given policy when it is implemented by an agent in a given environment.

One initializes V (S) by chance, which is the value that the agent will allot in each state S. One initializes the policy π to be evaluated. One repeats (for each episode): S is initialized One repeats (with each step of time of the episode): < - action given by π for S has The agent carries out the action has; one observes the reward R and the following state V (S) < - V (S) + α + γV () - V (S) S < - Until S is final

For more details concerning the implementation of this algorithm, or concerning the algorithmic version of Q-learning, to refer to the book of Sutton and Barto published in 1998, whose totality of the contents is available freely on Internet (see the external bond at the end of this article).

References

Random links:Party national progressist | Fanny Heldy | Sarah-Jeanne Labrosse | Fabrice Abriel | Phosphorus pentoxide