RLbook2020trimmed

Whereas the reward signal indicates what is good in an immediate sense, a value function specifes what is good in the long run

Roughly speaking, the value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state

fact, the most important component of almost all reinforcement learning algorithms we consider is a method for e!ciently estimating values

In fact, the most important component of almost all reinforcement learning algorithms we consider is a method for e!ciently estimating values. The central role of value estimation is arguably the most important thing that has been learned about reinforcement learning over the last six decades

For example, solution methods such as genetic algorithms, genetic programming, simulated annealing, and other optimization methods never estimate value functions. These methods apply multiple static policies each interacting over an extended period of time with a separate instance of the environment. The policies that obtain the most reward, and random variations of them, are carried over to the next generation of policies, and the process repeats

If the space of policies is su!ciently small, or can be structured so that good policies are common or easy to fnd—or if a lot of time is available for the search—then evolutionary methods can be e↵ective

The concepts of value and value function are key to most of the reinforcement learning methods that we consider in this book. We take the position that value functions are important for e!cient search in the space of policies. The use of value functions distinguishes reinforcement learning methods from evolutionary methods that search directly in policy space guided by evaluations of entire policies

If you maintain estimates of the action values, then at any time step there is at least one action whose estimated value is greatest. We call these the greedy actions. When you select one of these actions, we say that you are exploiting your current knowledge of the values of the actions. If instead you select one of the nongreedy actions, then we say you are exploring, because this enables you to improve your estimate of the nongreedy action’s value

Particularly relevant is the key feature of reinforcement learning that it takes long-term consequences of decisions into account

The problem of ensuring that a reinforcement learning agent’s goal is attuned to our own remains a challenge

How do you make sure that an agent gets enough experience to learn a high-performing policy, all the while not harming its environment, other agents, or itself (or more realistically, while keeping the probability of harm acceptably low)

Tags: #ai

One of the most pressing areas for future reinforcement learning research is to adapt and extend methods developed in control engineering with the goal of making it acceptably safe to fully embed reinforcement learning agents into physical environments