Wedenote the true actual value action

We begin by looking more closely at some simple algorithms for estimating the value of

actions and for using the estimates to make action-selection decisions. In this chapter, we

times, yielding rewards , then its value is estimated to be

values, not necessarily the best one. Nevertheless, for now let us stay with this simple

estimation algorithm and turn to the question of how the estimates might be used to select

knowledge to maximize immediate reward; it spends no time at all sampling apparently

inferior actions to see if they might really be better. A simple alternative is to behave

probability of selecting the optimal action converges to , i.e., to near certainty.

These are just asymptotic guarantees, however, and say little about the practical effectiveness of the methods.

testbed.