Policy Gradients Reinforcement Learning

## Description

Policy gradients are the foundation of modern deep reinforcement learning (RL) algorithms. In general, the objective of the RL algorithms can be expressed as

$$\theta\gets\underset{\theta\in\Theta}{\arg\max}\mathbb{E}_{\tau\sim{p_{\theta}^{\pi}(\tau})}\left[\sum\limits_{t=0}^{|\tau|}\gamma^{t}r(s_{t},a_{t})\right]=\underset{\theta\in\Theta}{\arg\max}\mathbb{E}_{\tau\sim{p_{\theta}^{\pi}(\tau})}\left[r(\tau,\gamma)\right],$$

where $\tau=\[s_{0},a_{0},\ldots,s_{T},a_{T}\]$ represents a single trajectory and $p_{\theta}^{\pi}(\tau)$ is the distribution of such trajectories by executing a policy $\pi(a_{t}\mid{s_{t}};\theta)$. Thereby, the policy gradients with respect to policy parameters $\theta$ are then given by

$$\begin{aligned}\nabla_{\theta}\mathcal{J}(\theta)
&=\int_{\tau}\nabla_{\theta}p_{\theta}^{\pi}(\tau)\cdot\left(\sum\limits_{t=0}^{|\tau|}\gamma^{t}r(s_{t},a_{t})\right)\mathrm{d}\tau\\
&=\int_{\tau}p_{\theta}^{\pi}(\tau)\nabla_{\tau}\log{p^{\pi}_{\theta}(\tau)}\cdot\left(\sum\limits_{t=0}^{|\tau|}\gamma^{t}r(s_{t},a_{t})\right)\mathrm{d}\tau\\
&=\mathbb{E}_{\tau\sim{p_{\theta}^{\pi}(\tau})}
\left[\left(\sum\limits_{t=0}^{|\tau|}\nabla_{\tau}\log\pi(a_{t}\mid{s_{t};\theta)}\right)
\cdot\left(\sum\limits_{t=0}^{|\tau|}\gamma^{t}r(s_{t},a_{t})\right)\right].\end{aligned}$$

## To-do List
- [ ] REINFORCE algorithm (_a.k.a., the [Vanilla Policy Gradient](https://link.springer.com/article/10.1007/BF00992696)_)
    - [ ] [Generalized Advantage Estimation (GAE)](https://arxiv.org/abs/1506.02438) 
- [ ] [Actor-Critic (AC)](https://proceedings.neurips.cc/paper/1999/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html) algorithm
- [ ] [Trust-Region Policy Optimization (TRPO)](https://arxiv.org/abs/1502.05477) algorithm
- [ ] [Soft Actor-Critic (SAC)](https://arxiv.org/abs/1801.01290) algorithm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Policy Gradients Reinforcement Learning #28

Description

To-do List

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Policy Gradients Reinforcement Learning #28

Description

Description

To-do List

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions