This repository implements the Deep Deterministic Policy Gradient (DDPG) algorithm using TensorFlow 2.x and OpenAI Gym. DDPG is an off-policy, model-free, actor-critic algorithm designed for continuous action spaces. This implementation trains an agent to control a pendulum using reinforcement learning.
Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment. The agent receives rewards for performing actions that lead to desirable states and aims to maximize cumulative rewards over time.
DDPG is an Actor-Critic method that extends the Deterministic Policy Gradient (DPG) algorithm. It uses:
- Actor Network - Determines the best action to take given a state (policy function).
- Critic Network - Estimates the Q-value (expected cumulative reward) of a state-action pair.
- Replay Buffer - Stores past experiences and allows sampling for training, breaking correlation between updates.
- Target Networks - Copies of the Actor and Critic networks that update slowly to stabilize training.
- Ornstein-Uhlenbeck Noise - Encourages exploration by adding time-correlated noise to actions.
The Actor Network aims to maximize the Q-value by updating its parameters via the policy gradient:
Where:
- (\mu(s|\theta^{\mu})) is the policy function (Actor network).
- (Q(s, a)) is the Critic network estimating the reward for action (a) in state (s).
The Critic Network updates its weights using the Bellman equation:
Where:
- (r) is the reward.
- (\gamma) is the discount factor.
- (Q(s', \mu(s'))) is the target Q-value from the target network.
- Observe the current state (s).
- Select an action (a = \mu(s) + \text{noise}) (to encourage exploration).
- Execute the action and observe reward (r) and next state (s').
- Store ((s, a, r, s')) in the Replay Buffer.
- Sample a mini-batch from the buffer.
- Train the Critic (Q-function update using Bellman equation).
- Train the Actor (Policy update using the gradient of Q-function).
- Update the Target Networks via soft updates:
Where (\tau) is the soft update parameter.
The agent learns to swing up and balance a Pendulum. The state space includes:
- Angle of the pendulum
- Angular velocity
- Torque applied
The goal is to minimize energy usage while keeping the pendulum upright.
Make sure you have Python 3.8+ installed. Then, install the required dependencies:
pip install tensorflow gym numpy matplotlibTo train the DDPG agent, run:
python DDPG_update2.py- CPU: ~5-10 minutes for 300 episodes.
- GPU (Optional): Faster, but ensure TensorFlow GPU is installed.
The model automatically saves weights every 20 episodes to avoid loss of progress:
if episode % 20 == 0:
ddpg.actor.save_weights("actor_weights.h5")
ddpg.critic.save_weights("critic_weights.h5")To resume training, weights are loaded if available:
try:
ddpg.actor.load_weights("actor_weights.h5")
ddpg.critic.load_weights("critic_weights.h5")
print("Model loaded! Continuing training...")
except:
print("No saved model found. Starting fresh.")After training, a reward plot is displayed to monitor performance:
plt.plot(rewards)
plt.xlabel("Episode")
plt.ylabel("Reward")
plt.title("DDPG Training Progress")
plt.show()✅ Rewards should gradually increase over episodes.
✅ The pendulum should balance around -200 reward (optimal).
✅ The training curve should stabilize after 200+ episodes.
LR_A = 0.0003 # More stable updates for Actor
LR_C = 0.003 # Faster Q-value updatesReduces noise slowly for better control over time:
if episode % 10 == 0:
ddpg.noise.std_dev *= 0.998 # Slower decayMAX_EPISODES = 300 # More training for better policiesPrevents unstable learning:
ddpg.memory.store(state, action, reward / 10, next_state)- Run with fewer episodes first (
MAX_EPISODES = 50). - Check if TensorFlow is using CPU only (
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"). - Reduce batch size (
BATCH_SIZE = 64).
- Increase training episodes (
MAX_EPISODES = 500). - Ensure learning rates are low enough for stability.
- Check if noise is decaying too quickly (
std_dev *= 0.998).
- Ensure the
actor_weights.h5andcritic_weights.h5files exist. - Start training without loading weights (
try-exceptblock).
If you find bugs or want to improve the implementation, feel free to submit a pull request.
This project is open-source and available under the MIT License.