The large circle indicates the zone the robot should enter and stay
it has entered the zone it should simply not leave it anymore. The
robot is initially placed with a random orientation in a random position
outside the zone, and during learning it is punished for collisions and
rewarded strongly for every time step it spends in the zone (for details
see below). Moreover, while outside the zone, it is rewarded for moving
as quickly and straight as possible 2
and keeping away from walls. It uses four infrared proximity sensors at
the front
A | B | C | ||
---|---|---|---|---|
motor outputs | motor outputs | motor outputs | ||
|
||||
sensor inputs |
|
sensor inputs | ||
motor outputs |
|
motor outputs | |
---|---|---|---|
|
|||
|
---|
2 The left and the right pairs of the six front sensors are averaged and used as if they were one sensor.
Figure 5. Simulated Khepera robot in environment 1. The large circle indicates the zone the robot should enter and stay in. The small circle represents the robot, and the lines inside the robot indicate position and direction of the infrared proximity sensors used in experiments 1 and 2.
2. Network training
Recurrent networks are known to be difficult to train with, e.g.,
gradient-descent methods such as standard backpropagation [Rumelhart,
1986] or even backpropagation through time [Werbos, 1990]. They are
often sensitive to the fine details of the training algorithm, e.g. the
number of time steps unrolled in the case of backpropagation through
time (e.g., [Mozer, 1989]). For example, in an autonomous agent context,
Rylatt [1998] showed, for one particular task, that with some
enhancements Simple Recurrent Networks [Elman, 1990] could be trained to
handle long-term dependencies in a continuous domain, thus contradicting
the results of Ulbricht [1996] who had argued the opposite. In an
extension of the work discussed in the previous section, Meeden [1996]
experimentally compared the training of recurrent control networks with
(a) a local search method, a version of backpropagation adapted for
reinforcement learning, and (b) a global method, an evolutionary
algorithm. The results showed that the evolutionary algorithm in several
cases found strategies, which the local method did not find. In
particular, when only delayed reinforcement was available to the
learning robot, the evolutionary method performed significantly better
due to the fact that it did not at all rely on moment-to-moment guidance
[Meeden, 1996].