Code for Double Gumbel Q-Learning
[.pdf] [Reviews] [Poster (.png)] [5-min talk] [1-hour seminar] [Errata]
Data (5.4 MB): https://drive.google.com/file/d/12wyYZ92bvVdkEQIHms8mVR5zYJZue-cd/view?usp=sharing
Logs (4.21 GB): https://drive.google.com/file/d/1LpR3lrKUx-qTaCrI4YViAjc0QA5kb8P2/view?usp=sharing
Due to an accident in saving data, the logs are incomplete and do not contain data for Figs. 2 and 7.
On Python 3.9 with Cuda 12.2.1 and cudnn 8.8.0.
git clone git@github.com:dyth/doublegum.git
cd doublegum
create virtualenv
virtualenv <VIRTUALENV_LOCATION>/doublegum
source <VIRTUALENV_LOCATION>/doublegum
or conda
conda create --name doublegum python=3.9
conda activate doublegum
install mujoco
mkdir .mujoco
cd .mujoco
wget https://mujoco.org/download/mujoco210-linux-x86_64.tar.gz
tar -xf mujoco210-linux-x86_64.tar.gz
install packages
pip install -r requirements.txt
pip install "jax[cuda12_pip]==0.4.14" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
test that the code runs
./test.sh
main_cont.py --env <ENV_NAME> --policy <POLICY>
MetaWorld envs are run with --env MetaWorld_<ENVNAME>
Policies benchmarked in our paper were:
DoubleGum: DoubleGum (our algorithm)DDPG: DDPG (Deep Deterministic Policy Gradients), [Lilicrap et al., 2015]TD3: TD3 (Twin Delayed DDPG), [Fujimoto et al., 2018]SAC: SAC (Soft Actor Critic, defaults to use Twin Critics), [Haarnoja et al., 2018]XQL --ensemble 1: XQL (Extreme Q-Learning), [Garg et al., 2023]MoG-DDPG: MoG-DDPG (Mixture of Gaussians Critics DDPG), [Barth-Maron et al., 2018, Shariari et al, 2022]
Policies we created/modified as additional benchmarks were:
QR-DDPG: QR-DDPG (Quantile Regression [Dabney et al., 2018] with DDPG, defaults to use Twin Critics)QR-DDPG --ensemble 1: QR-DDPG without Twin CriticsSAC --ensemble 1: SAC without Twin CriticsXQL: XQL with Twin CriticsTD3 --ensemble 5 --pessimism <p>: Finer TD3, where p is an integer between 0 and 4
Policies included in this repository but not benchmarked in our paper were:
IQL: Implicit Q-Learning adapted to an online setting, [Kostrikov et al., 2022]SACLite: SAC without the entropy term on the critic, [Yu et al., 2022]
main_disc.py --env <ENV_NAME> --policy <POLICY>
Policies benchmarked in our paper were:
DoubleGum: DoubleGum (our algorithm)DQN: DQN, [Mnih et al., 2015]DDQN: DDQN (Double DQN), [van Hasselt et al., 2016]DuellingDQN: DuellingDQN, [Wang et al., 2016]
Policies we created/modified as additional benchmarks were:
DuellingDDQN: DuellingDDQN (Duelling Double DQN)
Reproduced using raw data from Data and Logs.
Logs (4.21 GB) contains data for Figs. 1 and 6, while Data (5.4 MB) contains benchmark results for DoubleGum and baselines used in all other graphs, results and tables.
Due to an accident in saving data, the logs are incomplete and do not contain data for Figs. 2 and 7.
Ran by
python plotting/fig<x>.py
python tables/tab<x>.py
- Wrappers from ikostrikov/jaxrl
- Distributional RL from google-deepmind/acme
- Control flow from yifan12wu/td3-jax