Wedenote the true actual value action
We begin by looking more closely at some simple algorithms for estimating the value of
actions and for using the estimates to make action-selection decisions. In this chapter, we
times, yielding rewards , then its value is estimated to be
values, not necessarily the best one. Nevertheless, for now let us stay with this simple
estimation algorithm and turn to the question of how the estimates might be used to select
knowledge to maximize immediate reward; it spends no time at all sampling apparently
inferior actions to see if they might really be better. A simple alternative is to behave
probability of selecting the optimal action converges to , i.e., to near certainty.
These are just asymptotic guarantees, however, and say little about the practical effectiveness of the methods.
testbed.