Skip to content

Policy Gradients Reinforcement Learning #28

@mrbyflyg

Description

@mrbyflyg

Description

Policy gradients are the foundation of modern deep reinforcement learning (RL) algorithms. In general, the objective of the RL algorithms can be expressed as

$$\theta\gets\underset{\theta\in\Theta}{\arg\max}\mathbb{E}_{\tau\sim{p_{\theta}^{\pi}(\tau})}\left[\sum\limits_{t=0}^{|\tau|}\gamma^{t}r(s_{t},a_{t})\right]=\underset{\theta\in\Theta}{\arg\max}\mathbb{E}_{\tau\sim{p_{\theta}^{\pi}(\tau})}\left[r(\tau,\gamma)\right],$$

where $\tau=[s_{0},a_{0},\ldots,s_{T},a_{T}]$ represents a single trajectory and $p_{\theta}^{\pi}(\tau)$ is the distribution of such trajectories by executing a policy $\pi(a_{t}\mid{s_{t}};\theta)$. Thereby, the policy gradients with respect to policy parameters $\theta$ are then given by

$$\begin{aligned}\nabla_{\theta}\mathcal{J}(\theta) &=\int_{\tau}\nabla_{\theta}p_{\theta}^{\pi}(\tau)\cdot\left(\sum\limits_{t=0}^{|\tau|}\gamma^{t}r(s_{t},a_{t})\right)\mathrm{d}\tau\\ &=\int_{\tau}p_{\theta}^{\pi}(\tau)\nabla_{\tau}\log{p^{\pi}_{\theta}(\tau)}\cdot\left(\sum\limits_{t=0}^{|\tau|}\gamma^{t}r(s_{t},a_{t})\right)\mathrm{d}\tau\\ &=\mathbb{E}_{\tau\sim{p_{\theta}^{\pi}(\tau})} \left[\left(\sum\limits_{t=0}^{|\tau|}\nabla_{\tau}\log\pi(a_{t}\mid{s_{t};\theta)}\right) \cdot\left(\sum\limits_{t=0}^{|\tau|}\gamma^{t}r(s_{t},a_{t})\right)\right].\end{aligned}$$

To-do List

Metadata

Metadata

Labels

enhacementsNew features or enhancements to existing ones.
No fields configured for Feature.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions