#ai #computing # Model free **Policy Optimization**. Methods in this family represent a policy explicitly as $\pi_{\theta}(a|s)$. They optimize the parameters $\theta$ either directly by gradient ascent on the performance objective $J(\pi_{\theta})$. This optimization is almost always performed [[On-Policy]] [[Soft Actor Critic]] - performs gradient ascent to directly maximize performance [[Proximal Policy Optimization]] - updates indirectly maximize performance, by instead maximizing a surrogate objective function which gives a conservative estimate for how much $J(\pi_{\theta})$ will change as a result of the update **Q-Learning**. Methods in this family learn an approximator $Q_{\theta}(s,a)$ for the optimal action-value function, $Q^*(s,a)$. Typically they use an objective function based on the [[Bellman Equation]]. This optimization is almost always performed [[Off-Policy]] [[Deep Q Network]] - launched the field of deep RL C51 - learns a distribution over return whose expectation is $Q^*$ [Equivalence Between Policy Gradients and Soft Q-Learning](https://arxiv.org/abs/1704.06440) # Model based **TODO: filter out** There are many orthogonal ways of using models. In each case, the model may either be given or learned. **Pure Planning**. The most basic approach never explicitly represents the policy, and instead, uses pure planning techniques like [model-predictive control](https://en.wikipedia.org/wiki/Model_predictive_control) (MPC) to select actions. The [MBMF](https://sites.google.com/view/mbmf) work explores MPC with learned environment models on some standard benchmark tasks for deep RL. Expert Iteration. A straightforward follow-on to pure planning involves using and learning an explicit representation of the policy, \pi_{\theta}(a|s). The agent uses a planning algorithm (like Monte Carlo Tree Search) in the model, generating candidate actions for the plan by sampling from its current policy. The planning algorithm produces an action which is better than what the policy alone would have produced, hence it is an “expert” relative to the policy. The policy is afterwards updated to produce an action more like the planning algorithm’s output. The ExIt algorithm uses this approach to train deep neural networks to play Hex. AlphaZero is another example of this approach. Data Augmentation for Model-Free Methods. Use a model-free RL algorithm to train a policy or Q-function, but either 1) augment real experiences with fictitious ones in updating the agent, or 2) use only fictitous experience for updating the agent. See MBVE for an example of augmenting real experiences with fictitious ones. See World Models for an example of using purely fictitious experience to train the agent, which they call “training in the dream.” Embedding Planning Loops into Policies. Another approach embeds the planning procedure directly into a policy as a subroutine—so that complete plans become side information for the policy—while training the output of the policy with any standard model-free algorithm. The key concept is that in this framework, the policy can learn to choose how and when to use the plans. This makes model bias less of a problem, because if the model is bad for planning in some states, the policy can simply learn to ignore it. See I2A for an example of agents being endowed with this style of imagination.