Description
Policy gradients are the foundation of modern deep reinforcement learning (RL) algorithms. In general, the objective of the RL algorithms can be expressed as
$$\theta\gets\underset{\theta\in\Theta}{\arg\max}\mathbb{E}_{\tau\sim{p_{\theta}^{\pi}(\tau})}\left[\sum\limits_{t=0}^{|\tau|}\gamma^{t}r(s_{t},a_{t})\right]=\underset{\theta\in\Theta}{\arg\max}\mathbb{E}_{\tau\sim{p_{\theta}^{\pi}(\tau})}\left[r(\tau,\gamma)\right],$$
where $\tau=[s_{0},a_{0},\ldots,s_{T},a_{T}]$ represents a single trajectory and $p_{\theta}^{\pi}(\tau)$ is the distribution of such trajectories by executing a policy $\pi(a_{t}\mid{s_{t}};\theta)$. Thereby, the policy gradients with respect to policy parameters $\theta$ are then given by
$$\begin{aligned}\nabla_{\theta}\mathcal{J}(\theta)
&=\int_{\tau}\nabla_{\theta}p_{\theta}^{\pi}(\tau)\cdot\left(\sum\limits_{t=0}^{|\tau|}\gamma^{t}r(s_{t},a_{t})\right)\mathrm{d}\tau\\
&=\int_{\tau}p_{\theta}^{\pi}(\tau)\nabla_{\tau}\log{p^{\pi}_{\theta}(\tau)}\cdot\left(\sum\limits_{t=0}^{|\tau|}\gamma^{t}r(s_{t},a_{t})\right)\mathrm{d}\tau\\
&=\mathbb{E}_{\tau\sim{p_{\theta}^{\pi}(\tau})}
\left[\left(\sum\limits_{t=0}^{|\tau|}\nabla_{\tau}\log\pi(a_{t}\mid{s_{t};\theta)}\right)
\cdot\left(\sum\limits_{t=0}^{|\tau|}\gamma^{t}r(s_{t},a_{t})\right)\right].\end{aligned}$$
To-do List
Description
Policy gradients are the foundation of modern deep reinforcement learning (RL) algorithms. In general, the objective of the RL algorithms can be expressed as
where$\tau=[s_{0},a_{0},\ldots,s_{T},a_{T}]$ represents a single trajectory and $p_{\theta}^{\pi}(\tau)$ is the distribution of such trajectories by executing a policy $\pi(a_{t}\mid{s_{t}};\theta)$ . Thereby, the policy gradients with respect to policy parameters $\theta$ are then given by
To-do List