The large circle indicates the zone the robot should enter and stay

it has entered the zone it should simply not leave it anymore. The robot is initially placed with a random orientation in a random position outside the zone, and during learning it is punished for collisions and rewarded strongly for every time step it spends in the zone (for details see below). Moreover, while outside the zone, it is rewarded for moving as quickly and straight as possible 2
and keeping away from walls. It uses four infrared proximity sensors at the front

A	B		C
motor outputs	motor outputs		motor outputs
	motor outputs		motor outputs
			hidden units
	sensor inputs	memory units (t-1)	sensor inputs
	sensor inputs	memory units (t-1)	sensor inputs

	motor outputs	state units	motor outputs
	sensor inputs

and a ground sensor, 3 which is only fully active in the time step when the robot passes the black line marking the zone border. As mentioned above, these five sensors provide input to the controller networks (see Figure 4), and the network’s two output units directly control the speeds of the two wheels. The experimental setup is intentionally kept very simple in order to illustrate as clearly as possible the basic mechanisms of behavioral adaptation in recurrent

2 The left and the right pairs of the six front sensors are averaged and used as if they were one sensor.

Figure 5. Simulated Khepera robot in environment 1. The large circle indicates the zone the robot should enter and stay in. The small circle represents the robot, and the lines inside the robot indicate position and direction of the infrared proximity sensors used in experiments 1 and 2.

2. Network training
Recurrent networks are known to be difficult to train with, e.g., gradient-descent methods such as standard backpropagation [Rumelhart, 1986] or even backpropagation through time [Werbos, 1990]. They are often sensitive to the fine details of the training algorithm, e.g. the number of time steps unrolled in the case of backpropagation through time (e.g., [Mozer, 1989]). For example, in an autonomous agent context, Rylatt [1998] showed, for one particular task, that with some enhancements Simple Recurrent Networks [Elman, 1990] could be trained to handle long-term dependencies in a continuous domain, thus contradicting the results of Ulbricht [1996] who had argued the opposite. In an extension of the work discussed in the previous section, Meeden [1996] experimentally compared the training of recurrent control networks with (a) a local search method, a version of backpropagation adapted for reinforcement learning, and (b) a global method, an evolutionary algorithm. The results showed that the evolutionary algorithm in several cases found strategies, which the local method did not find. In particular, when only delayed reinforcement was available to the learning robot, the evolutionary method performed significantly better due to the fact that it did not at all rely on moment-to-moment guidance [Meeden, 1996].