RLbook2020trimmed
Whereas the reward signal indicates what is good in an immediate sense, a value function specifes what is good in the long run
Roughly speaking, the value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state
fact, the most important component of almost all reinforcement learning algorithms we consider is a method for e!ciently estimating values
In fact, the most important component of almost all reinforcement learning algorithms we consider is a method for e!ciently estimating values. The central role of value estimation is arguably the most important thing that has been learned about reinforcement learning over the last six decades
For example, solution methods such as genetic algorithms, genetic programming, simulated annealing, and other optimization methods never estimate value functions. These methods apply multiple static policies each interacting over an extended period of time with a separate instance of the environment. The policies that obtain the most reward, and random variations of them, are carried over to the next generation of policies, and the process repeats
If the space of policies is su!ciently small, or can be structured so that good policies are common or easy to fnd—or if a lot of time is available for the search—then evolutionary methods can be e↵ective
The concepts of value and value function are key to most of the reinforcement learning methods that we consider in this book. We take the position that value functions are important for e!cient search in the space of policies. The use of value functions distinguishes reinforcement learning methods from evolutionary methods that search directly in policy space guided by evaluations of entire policies
If you maintain estimates of the action values, then at any time step there is at least one action whose estimated value is greatest. We call these the greedy actions. When you select one of these actions, we say that you are exploiting your current knowledge of the values of the actions. If instead you select one of the nongreedy actions, then we say you are exploring, because this enables you to improve your estimate of the nongreedy action’s value
Particularly relevant is the key feature of reinforcement learning that it takes long-term consequences of decisions into account
The problem of ensuring that a reinforcement learning agent’s goal is attuned to our own remains a challenge
How do you make sure that an agent gets enough experience to learn a high-performing policy, all the while not harming its environment, other agents, or itself (or more realistically, while keeping the probability of harm acceptably low)
Tags: #ai
One of the most pressing areas for future reinforcement learning research is to adapt and extend methods developed in control engineering with the goal of making it acceptably safe to fully embed reinforcement learning agents into physical environments